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In November 1972, educators from several fiarts of the Uni- 
ted States met at the University^ of North l)akota to discuss 
some common concerns about the narrow accountability ethos 
that*had begun to dominate schools and to share what many 
believed to be more sensible means of both documenting and 
assessing children's learning. Subsequent meetings, much 
sharing of evaluation information/; and financial arfdXoral 
support from the Rockefeller Brothers Fund have ail con- 
tributed to keeping together fthat is now called the North 
Dakota Study Group on Evaluation. A major goal of the 
Study Group, beyond support for individual participants 
and programs, is to provide materials for teachers, par- 
ents, school administrators and govewfmental decision- 
makers (w^ithin State Education Agencies and the U.S. Office 
of Education) that might encourage re-examination of a 
range of evaluation issues and perspectives about schools • 
and schooling. 

Towards this end, the Study Group has initiate^ a^ 
continuing' sefies of monograqphs, of which this paper itf 
ond. Over time, thet series Hill include material on, 
among othe^ things, children's thinking, children's langr 
uage, teacher support systems, inser\'ic6 training, the 
school's relationship to the larger commuftity. The intent 
is that these papers be taken not as final statements--a 
new ideology* but as working papers, written by people 
who are acting on, not just thinking about, th^se problems, 
whose implications need an active and considered response. 
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The widespread use of tests for purposes of selec- 
tion, for deciding from Kindergarten on up who will 
a pass and who will fail, who will be winners and who 
will be losers, is not likely to go away in a hurry. 
For, whether we like it or not, it has become indi- 
genous to the kind of competitive culture that char- 
acterizes our social institutions, including our ed- 
ucational institutions. Henry S. Dyer ^ 



Introduction 



The coimnon practice of assigning the task of making judg- 
ments about programs or childrea*s ^achievements to outsi- 
ders stems from a desire that evaluation be 'objective*, 
exempt from local influence's, and applicable to any num 
ber of different situations. This concern wi-tl> 6bjecti- 
vity can be traced to efforts in the social sciences to at- 
tain the objectivity of the physical and biological sci- 
ences, where, according to popular belief, a descjiption 
of an experiment by one group should be such that any 
other competent* s^ienti^t can repeat the experiment; where 
the description of what happened should be so divorced 
from local events', or' time-dependent parameters (except 
for everts in time) ,that any other competent person could , 
repeat them. 

Although such a position is widely accepted in the 
physical sciences as appropriate in prt^ai^le^ 
scie ntists knohLthnt many^ :ffle5"^thBy^annotrepeat other 



Experiments, or that it sinply is not worth the bother to 
make the effort to duplicate them. What is required in 
physical science work is: first, that the results repor- 
ted be consistent with accept ei^heory and, secondly, that 
the results reported or the compouiTd^--^}repared ^how' the 
properties that one would expect of sucnSwJterials in the 
common course of events. 

In evaluating educational work, the publie^ can also 
expect, first, that the evaluation effort be consistent 
with the practice assessed: that it show reasonable re- 
sults or ascribe properties to the educational system or. 
outcomes consistent with what is generally known about 
children and legfming; and, second, that the description 
of the evaluation activities and their relation Ito the 
program are such that other interesteSi parties ckn carry 
out similar activities or compare thtm with their own ex- 
perience. ^ • 

It is naive, however-, to assume that air evaluation 
is objective simply because it is carried ont by someone 
not connected with the ongoing activity. , Certainly people 
with a stake in a program must work out some way to recog- 
nize their own enthusiasm and self-interest in what hap- 
pens and therefore take measures to minimize this influ- 
ence. But a realistic recognition ofth^s problem i5 more 
likely to result in relatively object ive evaluations than 
dpes the' reliance on outsiders. . 'Professional* outside 
observers are stilL human and therefore as open to the 



same problems of honesty aqd objectivity as arfygne else, 
Byt the problems surrouriding evaluation are greater 

~tKan~*jmy methodological issue. Each kind of educational 
philosophy requires its own approach to evaluation. An 
analysis of evaluati6n has to ask what the goals of the 
program are, and how any evaluation strategy supports and^ 
influences ^he program. In other words, evaluation must 
be considered both for itself and for its impact on the 
total program, not as a separate activit:y oarried ci out- 
side the confines of the rest of school. Like curricu- 

• lum^ 'teacher training, and school organi zat ion evalua- 
tion activities are an integral partxof sciiool, influen- 
cing every other part. 
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Open Education Principles 
RelevantHo Evaluation ' 



I can establish some sense the evaluation and measure- 
ment strategies suitable to open, education by listing 
briefly some of the principles on which the open educa- 
tion movement- is based. 



DEVELOPMENTAL ISSUES 

First and foremost, open education includes the belief 
that the individual growing child ijs edu'~able and stands 
at the center of the educational pifocess. Jhis data has, 
of course, been the rallying cry of educational thinkers 
.in,tlle lib^eral tradition for hundreds^ of ^ars . ^ Th« \ 
statement takes onji'ew meaning, however, wTT^n'cuJt'eht^^''^^ » 
knowledge of c^iiWren's growtb and development is added' 
to it<r s 

It is now recognized not only that chiloren are d^f-^ 
ferent from adults, but that children differ from each 
other; they go through developipental stages at varying 
rates and with varying learning styles. Child develop- t 
ment elcperts no longer speak about [the sij^-year-old' , 
but 'about the range of activities, ideas and concepts ? 
which different individuals in a group of six- year -oJiUf 
wilt exhibit. Educatprs and ^^^rente have to ask wKether 3 
the evaluation programs currtn,tly»ayaila!)le in .our schools 
take into account this variability in development. 

Another difference, in learning styla among children 
IS in the 'horizontal* dimensiorrs- of theii growth and de- 
velopment (Bussis and Chittenden, 1973). Individual chil- 
dren not only reach different stages of development at 
various times in their lives, they ajso need to spend va- 
rious amounts of time confirming and internalizing those 
stages. Most people know the distinction between learn- \ - 
ing the meaning of a new word an^ being able to use it un^" 
iself-consciously in speech. There is always some span of \^ 

^'time when one ^tentatively tries out the word, pexhaps plays^ 
with it a little, listens to the sound ih a sentence, and \ 
sees its effect on ofhers before one cdn Vafely and natur- \ 

' ally use a new expression. This learning ^time will vary 
from word to word, from situation to situa^^ion, and ac^ • ^ 
•cording to the need to use the word. If on^ watches chiK 
dren learn to speak, this phenomenon \s app'Ur^nt . .The 
same pattern qccurs, of course, with* all 'the\ concepts and 
ideas anyone acquires. ^ 



• Horizontal dimensions of learning db not sho^ilup as 
a mastery of'greater number's of words or kj^wledge of more 
concepts. Iris.tead they manifest themselves in the rich- 
ness of- associations that a child is'capable o'f, in the 
variety of way^ that a concept or word can bp. used. They 
are an important part ^of the lemming process, ^although 
seldom measured, in evaluation procedures. ♦ Any educational 
program that takes individual children seriously has to 
take this horizontal component of learning into account. 

• Along with the recognition o£ children's indi\ridual 
rates and styles growth cSbes .a reluctance, to put ri- 
gid timetables on the acquisition of skills or knowledge. . 
In receot yeaYs a few psychologists have argued that' un- 
less children learn to retfd by the- age^ of six or seven or 
acquire other school-relate ski He ^n the early years, 
they will be permanently behind. (Much of •this argument 
stems from the 'cultural deprji^vation^ perspective" on ^pv- 
erty, which claims :hat the difference between poor and 
rich is that the- 'culture of poverty' does not include 
the appropriate formative -ei^periences children need to 
benefit* from school.) But tbis view ha^ recently .been put 
into question by. findings of societies wh'ferje young chil- 
dren are kept \ery quiet and inactive yet gr^ into ac- 
tive intelligent adultsH^K^gstn, 1972). Given such diver-^ 
■•gent evidence, i»C is more productive to look at the vari- 
ation amoTig individual children, to st;ydy their styles 
and thei^ growth, than it is to try tc^generalize about 
the maximum necessary conditions for rapid attainment of 
skills and accomplj.shments; , c 

Finally, advocates of open education believe strongly 
that the way in which children can progress through the 
varioii^ stages of development to mor^ adult underst^dings 
of the world around them i% through. exposure to that 
world. Like good physical growth, which is only partly 
determined and requires nourishing food and regular use 
of muscles and limbs, mental emotional , and social gi:9wth 
requires constant, active involvement with the rest of 
one's immediate world. Children's understanding of the 
"physical world comes about as they play with things, ob- 
serve them, manipulate them and geijerally begin to affect 
tljem. Children also learn about themselves and other 
people, about feelings, about cooperation or competition, 
by being in Sucial situations and exploring their ramifi- 
cations. This interaction w^ith the^things in the wortd 
^ is not only important for learning about these things, it 
is essential if children are^jto-leam how to learn 'about 
them. 



SOCIAL ISSUES 

Another set of beliefs about children, social in origin, 
makes the findings of developmenta-l psychology c<?pecially 
si;gnif#cant. A belief in the educabilitf of all children, 
and the recognition of the individual* qualities of each 
child,- i§ essentisfily- a belief in the Value of each indi- 
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vidual--child or amitt. It follows then that it'we respect"^ 
►each individual*, we^jnust be concerned with his or her per- 
sonal growth. ^ 

To provide maxjfmum' opportunities for e^ch individual 
to grow and develop most fully, it 'is necessary to mini- 
mize the social influences that prevent the attai«nment of 
thes'e go^ls, and to do everything possible to avoid dama- 
►ging or stifling situations. If there is ^ question pf 
nutrition of physical health, educators who are concerned 
§or •the growth of individuals 'wi^l do all ^they can to see ' 
that children are well fed and healthy, likewise educa- 
tors must also combat social- diseases that threaten *to , 
harm the individual child: iforms pf prejudi^ce or stereo- 
typing .that force children into roles or categorize them 
independent of their individual qualities,' Racism and 
sexism, among the most ^n^nse forces in bur society, or 
anv practice or ten<*ency that categorizes fihildrfen arbi- 
trarily by some external- factor, robs them of some portion 
of their at>ility to grow and to learn. Stereotyping gets , 
in |:he w£^ 6f seeing a chi^ld as an individual, interferes 
with providing experiences for a child from the whole, 
range of lifd, and diminishes the opportimity to follow 
an individual timetable of horizontal and vertical 
growth. 

Once children are placedin categories Recording to 
some specific quality tJaey evidence at a particular time, 
there is an inevitable ordering of tfrose categorizes and a 
decrease ir? th^ respect whjich ii ?hnwn to those children 
during their development. Two major components of this 
tr^nd can be ascertained. First, there is an inevitable 
ranking of children by the style they show. The mos^ 
common practice in schools, which invariably places chil- 
dren into. categories judged in an order from 'best' to 
'worst', is tracking--the setting up of streams .for the 
children on the basis the rate at which they attain 
certain skills. I know' of no system of tracking ufhich is 
not also accompanied by a valtie^^^^Ad^ent about which are 
the best children and which are the worst. Yet even the 
slightestr-knowledge -of develdpmental theory should tell • 
^ us that how fist one learns something has very little to 
do with how well one learns it or how much one can learn. 
The almost universal outcoipe of sorting children is that. ^ 
the^ stay in the group into which they are placed; de- 
spite the biological evidence that such cate"feorization at*^ 
a particularly early age should have little to do with 
later achievement. • 

This persistence of .tracked groups, whereby the ini- 
tially 'sloifrf' students tend to remain slow, despite the 
fact that social or physical development mky speed up, is^ 
perhaps among the best evidence that tracking students is 
not simply^ a convenient pedagogip device buu results in 
self-fulfilling and often damaging value, ,jud^ents. 
There absolutely na»Teason to assume that child who 
learns to read late wil# be a poor read^^r,, ju^t as there 
is no reason to assume that a child who leams\to speak 
later than another will be a poor talker. There are 
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many parents who ha^U is covered that a late start in talk- 
ing has not prevented their child from becoming a voluble, 
and constant chatterer later onl If all-cI those students 
who start slowly in certain school skills stay sloii, it ^ 
I. suggest^ that the result is due more to, their school clas- 
sification than because of biological necessity. 

A second related point ii}"" the commo n knqwledge. about 
the cprrelation between social behavior and school ttack. 
By and large,- the highest tracks in a school--that i^, the 
fastest children--are usually the best behaved, while the 
slowest are the poorest behaved. . Ag^in, there is* no rea*--- 
son to believe that the slow development oY mental pro- 
cesses-alone has any relati.^nship to behavior. Mo one ^ 
iissumes that child rer^ who are ^lo^.: to lekm to walk or 
talk, or who grow slowly, ^ present more serious behavior * 
problems than those who do this rapidly.. If. children who, 
happen to be slower in development, of reading skills than 
others, for example, end up Uehaving badly^the reason may 
well lie in the way they are treated "because of^his de- ^ 
velopmental trait. " . 

A respect for children and a* desire to se<B them de- 
velop fo their fullest potential also r'equires coopera- 
tion and mutual interaction in leqiming, rathe^^ph. com- 
petition and isolation* Educational environments should 
maximize opportunities fojr children to become conformable , 
with the world, \tO'face it, to striicture and order it. 
* 'loathe exte f that schools rigidly clarify si^ject ^d 
process for children^ they deny children the experience 
they nee4 in order to organize the world. If we afccept 
that experience is necessary in ordey to understand the 
world, then schools" must endeavor to 1. Ip oiildren to un- 
derstand the connections between things by enabling them 
to make these connections. For example,^ there is ^a re- 
lationship between spelling and writing and, reading, but 
it is a lot harder to. understand what it is if these 'sub- 
jects* are always taught in isolation or in a particular 
order. There are certainly connections and tremendous . 
overlap between art and science and crafts, cojviections 
whith ai^e variable and of different significance to ^ar- 
•' .ticular people. But the only way for any of us to be, 
able to make those connections for ourselVes is to have 
the opportunity to n;ake cross- links through oqr wo/k. ^ 

In a similar manner, it^isiSbnly possible to learn 
about the social world by participating in it. School-- 
if it is an educational instltution--must support social 
interaction. This means fosfeTing cooperation, sharing, 
assistance and all forms of social- relations betweeif 
people: children with children, children with adults, 
and adults with adults in ways which allow the indivi- 
duals concerned to get to know each other better and to 

'learn cooperatively from each other. Most competitive 

situations have just the opposite effect: they Maw 
people into themselves, encourage them to Become sus- 
picious of others,* to keep things to themselve3--in sjiort, 
they isolate people from eacl^ other. This isolation, dis- 
coutfiiges children from learning and growing. 



6 



Competitive situations also are inconsistent with ap- 
preciation for individual differences in growth and style. 
Soiiie children read faster than others but not therefore ne- 
cessarily better. (The same is true of "Adults: reading 
speed has very little correlation with intellectual train- 
ing or ability, or, for that matter, with retention of 
what is read.) Some children are good at spelling out ^ 
loud, others are poor at it. Most important, some chil- 
dren can do any number of these things well at some times\ 
and not at others, fhe point is that stress on isolated 
measurement of particular skills does not really enrich 
our knowledge of children's growth. Instead it tends to 
make us stereotype children in respect to a few proper^es 
-and to forget to ask what they are like as people and what 
other st-^engths, weaknesses, and interests they may have. 



\ 
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General Position Statement on. Evaluation 



1, Children go through stages of physical, mental, emo- T 
cional and social"^owth, and it is important to know 
where the children are, at a given point, and what one may 
expect next. In many instances, these stages can be ex-- 
pressed ijw quantitative manner.* There is no pointy in 
saying th'St Susan is short or that she is just so high; 
one can readily report that she is 42-inches tal^. Like- 
wise, abilities to'read. or compute can be described with 
some precision. However, it is always important^'vo recog- 
nize just what tho^e measures mean. Being 42-inches high 
at; a^e six is a fact, but on^ that has little relation to 
worth or general ability, or even to^how high Susan can 
jump or whether she can run fast. Likewise, a sight vo- 
cabulary score is jwst that and nothing more. 

Perhaps most important to stress in discussion of 
quantitative measurement is that we are iiyterested in 
these measurements, because of what the^^tell us aboi t that 
child, not because they permit easy comparisons. This 
^oint is' at the heart of all discussions about evaluation 
and measurement. Any statement about a child's achieve- 
ment and level teUs something about' that child, and is 
significant in'a description of that cliild. So first and 
most obvious., evaluation results should not be formulated 
in terms of averages, but in terms of individual results. 
Os, to put the same statement in somewhat more technical 
language, what is interesting in any group measurement iS' 
not the mean but the variance. 

It is also important to remember that in the assess- 
ment of any evaluation effort, what has actually been mea- 
sured nust be clearly stated. A measurement of height or 
weight is direct and we know what it means; many educa- 
tional evaluations are not. For example, students who can 
read quite well and comprehend what they read, may receive 
a low scofe on a readings test if they do badly on some 
technical sections , such as bleiding, syllabification, 
auditory discrimination (Allen, 1974). 

2. Evaluation practices must respect the setting in 
which the educational effort takes pi^ace. That is, it is 
necessary to adapt the evaluation to the program rather 
thjan vice versa. When the educational- endeavor is one 
which advocates learning through interaction with the 
world and through social interaction, it becomes particu- 
larly important that the measurement of children'^ growth 
also involves the 'stuff of the world and permits social, 



cooperative intera^tioit: The whole process of evaluation 
must also take into account the effect of the testing it- 
self on the school setting and on the whole program. This 
is impoitant both in the ;neasurement of children's pro- 
gress and in the evaluation of programs. 

3. All evaluation efforts should recognize the dis- 
tinction between saying and doing, feetween verbal kr 
ledge and ability to use information. If I want to find 

a good mechanic for my car, I usually don't ask the mech-- 
anic to describe the internal combustion engine. If I 
want an electrician, I don^^ ask for a definition of elec- 
tricity; I want to find ou't about th^ work of these - 
people. Similarly, evaluation of children's work should 
take into account the doing- of that work, not mer^Jy de- 
scriptions of it. " 

4. The value of any evaluation is in -direct propor- 
tion to its usefulness, to how mucft it can help a child's 
education.. If there is any measurV of what a child can 
aqd cannot do, then this should be\n a form and at a 
time when it can directly assist thd people who are wor'^- 
ing with that child. - • * , 

To sumniarizei any evaluation of childr'^n's perfor- 
mance, whether quantitative or qualitative, saould. stress 
the inflividua/^ results rather than make comparisons*, ex- 
press these jfi^sults in a manner that is usfeful to the 
people inv(/ved, relate it to the particular educational 
setting, and recognize '..hat children are complex beings^ 
with ^a wide set of attributes and influences. 
/ 



Present Status of Educatiof 2I Evaluation 



Evaluation is making judgments about a process; education- 
al evaluation involves making iudgments about 4 social, 
public activity. Exrjirples of evaluation questions are: - 
Is this school adequate? Is tMs a good teacher (princi- 
pal, administrator)? Is child making reasonable pro- 
gress? Sh uld we use curriculum "a" or "b**? The way to 
arrive at these decisions is to use the best and the most 
informatcon possible. One major source of information is 
some kind of measurement; or, to put it the other way a- 
round, a particularly useful way of gathering data neces- 
sary for making judgments is to make measurements, to 
collect data, and to piftsent it in soire orderly form. 
But obviously this is only one component of responsible 
decision-making. ^ 

The passage of the Elementary and Secondary Educa- 
tion Act in 1965 marked the first major direct introduc- 
tion of federal funds into the public school systems. 
Much of this money was allocated for programs directed 
towards poor and minorit> children. The advent of this 
intense effort of federal money spent on the schools 
brought with it a sudden cry f^r ' accountability' --for 
finding out whether the money spent was doing any good. 
Although it is easy to understand why questions should be 
raised about the expenditure of federal money for educa- 
tion, as in other ar^as, it is worth noting that "such 
questions were first seriously raised 0 .Xy when money was 
beginring to be spent in poor districts and to alleviate 
thp educational shortcomings of poor and oppressed stu- 
dents » ^ * . ^ 

Also the nature of the questions was of a very in- 
teresting kind. The major issue was not whether the 
money was spent as the law required, that is, specifically 
for rifeading improvement or for the arts or for bilingual 
education, blit whether the money was actually solving, pro- 
blems that existed in the schools\ In other words, ^he 
stress on evaluating programs focused on the results that 
might arise, not on the way the money was expended. A 
comparabi situation would be if the massive highway fuiid 
had been fvaluated in terms of whether it solved the 
transporliation problems of the United Statiss, i^ther than 
whether the money had actually been spent on highway con- 
struction, labor, cement, steel, etc. 

In response to the outcry' for evaluation, the edu* 
cational community brought to bear its best and brightPot 



minds^ concerned with evaluation. The field became more 
visible, and considerable amounts of writing and prescrip- 
tion followed, fn 1967, the American Education Research 
Association (AERA) began to publish a series of pamphlets 
on the subject of evaluation. Several of the articles in 
the first* issue discussed 'professional* evaluation and 
urged strongly that evaluation studies not be left in the 
hands of amateurs but entrusted to professionals. 

It is interesting -to think about who the coiinter- 
parts to 'professional * evaluators would be in other 
fields of human endeavor. When thie stress is put on the 
measurement part of the work, as it often is in the lit- 
erature of evaluation, then it "is tempting to think of 
evaluators as the^analysts of the field. By this defini- 
tion, they are the people who do work comparable to chem- 
ists who analyze compounds for their elemental components: 
the amount <>f carbon, hydrogen, and nitrogen in a com- 
pound. But the analytical chemists by themselves do not 
make judgments. They simply follow a procedure estab^ 
lished by practicing chemists and report results from the 
procedures; they^j^ert^inly do' not , or should not, make ' 
value judgments. A chemist would be surprised and annoy- 
ed if she received a report from an analyst which sa.id, 
*'the compound you sei.t me contained 67 percent carbon, 8 
percent hydrogen, 19 percent nitrogen and isn't worth 
reporting in the literature!" ' 
In education, professional evaluators have a role 
which IS much more like management consultants, consumer 
advocates, or any .of that range of people who try to look 
at some social activity critically and then make judg- 
ments about it. If there is one thing we know about this 
whole kind of activity, it is that although good judgments 
require careful collection of data and T.^asurement as a 
necessary activity, this is hardly sufficient. In fact, 
when the preoccupation uith details and data gets too 
great, then the most important issues can be forgptten. 
In the Best and the Brightest, a book about the Vietnam 
war, David Halberstam points out',.'the limits of great pro- 
"Tfessionals hard at work on a social problem. In describ- 
ing the decision-makers in the Pentagqn during the war, 
Halberstam reports that while there was much analysis, 
great gathering of dafa, body counts, and constant reports 
from Vietnam, some simple general truths tended to be for- 
gotten, and crucial questions remained unexamined. Con- 
sequently the data ptoved over and over that we were win- 
ning the war despite the c trary evidence. ' ^ 

A second issue concerning professional evaluators . 
has to do with their background. If there is such a thing 
as professional evaluation, then the members of this pro- 
fession must have been trained somewhere, or must at least 
have some identity as a profession and some views and ideas 
they share as members of this profession. By and large, 
people who call themselves professional evaluators in edu- 
cation have been trained in sbcial science research in uni- 
versities. Most of them come from 'educational psychology^ , 
dep-irtments or similar departments with other names. Their 
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reference point is the Ameri<:an experitftental psychological 
tradition, especially as practiced in the field of^educa- 
tion. 

-educHti^nal psychology, like every field of science, 

has its own style of operation and its own way of defining 
§xperim«.nts, goals, and- approaches to problems, even its 
own style of defining what a problem is. What appears to 
be a reasonable approach in one field, however, is just 
not acceptable in another. American experimental psycho- 
logy, with its sttong behavioristic strain, has developed; 
a particular scientific tradition, with its own ndrms, 
methods, and goals. -Bwt this is "dimply not appropriate to 
the entire range of human activity. Schools are social 
institutions carrying on a complex social and cultural ac- 
tivity^ they are not experimental labo'ratories in which 
controlled conditions can be established and isolate4 
events studied relatively separate from their surroundings. 

For some years 1 taught chemistry and biochemistry 
in a large urban university. I taught undergraduate ' 
courses in organic chemistry and supervised graduate stu- 
dents doing biochemical' research with enzymes\. We pub- 
lished papers in respectable journals, received federal 
financial support and the students who worked with me 
successfully competed for scholarships and professional 
recognition. Yet, a few^of my colleagues in the depart- 
ment ^consistently told me that I wasn't doing \'real' re- 
search, that biochemistry wasn't a 'real' scie\ice, and- 
that my students weren't getting*' good' or sufficient 
trailiing. The only way to satisfy these colleagues Xwho 
were in the more physical end of chemistry) would have 
been for us tq give up our particular interest and to 
adopt theirs, along with the techniques, the particular 
mathematical tools, and the general styles of apprpach 
which appeared to them as the only appropriate one$. Of 
course, my critical colleagues. in physical chemistry were 
being told by .some physicists that all chemistry was just 
a minor, imperfect, and not veTy important branch of phy- 
sics, and that only the physicists were doing 'real' sci- 
ence! As Kurt^Vonnegut would say, "so it goes," 

The experimental psychological tradition is at some 
point in this spread of the range *of science, With its 
own particular models and techniques. Whether academic 
research in experimental psychology is the best model for 
evaluation studies Is a question. Before going into it, 
however, we have to ask whether any traditional acad^emic 
research style is an appropriate model. 

The moral dilemmas that enter into 'pure' scientific 
research, such as basic physics, or chemistry, were made 
painfully obvious to society by the events surrounding 
the Second World War. Recently researchers* in biology 
have argued that certain experiments simply should not be 
carried out until- the possible social consequences are 
evaluated. The^ moral and social questions are constant- 
ly troublesome in all uses of social* science, Unfortuna- 
tely, traditional descriptions of scientific method are 
based on vic^s that fail to take these factors into ac- 
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count. Foi^ centuries, scientists have developed a style^ 
of 'objectivity* and a set of methodologies that have ig- 
nored the social implications of research. 

This discussion assumes that academic research is 
carried out by a specific set of rules and that evalua- 
tion work also follows these rules. This is a fairly 
standard textbook view of science and of activities of all 
sorts: that, there ate some right ways of doing it an?i 
some wrong ways, and that people who carry out the work 
do it coriectly^ or elsje are frauds or failures. Of 
course the world of actual practice doesn't work that 
way. There are some 'proper procedures' , some 'correct* 
ways of carrying out anything, whether it is repairing 
cars, nonning a factory, or doin^ rese^iTch. But these 
correct ways change with time, and more .important, anyone 
who does work well knows there are tim^s when you simply 
throw the rules out the window and do whatever you haye 
to do to get the job done. 

Moreover, particularly significant measurements 
sometime require new instruments. Part of Galileo'.s pro-, 
blem in convincing people of his evidence for the organi-^ 
zation of the solar systen was 'hat he v^s usi.»g a whole 
new measurement technique: he was looking at the m6ons 
of Ju'piter through a telescope. Was that a legitimate 
measurement device? People had to decide whether it was 
or not. Similarly, various indirect ways of looking at 
nature- -measuring electrical charge, spectroscopy, radio 
waves--had to be accepted as legitimate measarement de- 
vices. In mafiy c^ses, the advent ^f a new bit of science 
or technology required that the new way of measuring also 
had to be invented and then accepted as part of the pro- 
per instrumentation. , 
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Some Characteristics of 
Evaluation Paradigms 



im EX)MINANT MODEL 

It is worthwhile to look at the general nature of the re- 
^ Search design methodology that dominates measurement in 

the field of educational e^^aluation in order to decide • 
just how relevant (or' inappl^opriate) it might be.* The * 
. general model comes from that branch of psychology that 
^ attempts to model* itself on research which proved particu- 

larly successful in 18th century physical science, and was 
then applied in the 19th century to more practical pro- 
blems; 'and to areas which needed a little manipulation iri 
order to meet the same criteria for, research. Each field 
of science has particular methodological issues whidi are 
difficult, and otheY^ which ate relatively simple depend- 
ed ' ir*g on the nature of the^ubject matter. In obsen/ationr 

al astronomy, for example, it is relatively simple to ' 
carry out and standardize repeated direct observations. 
The phenomena iif^he sky are uniquely there. They repeat 
^ - ' themselves, and ail one has to do is sensibly and patient- 

ly observe. Also, the phenomena^are accessible every- 
where on the earth, relatively stable for centuries, and 
similar over large areas of the earth, so that checkiilg 
observatio^is from one point in space or time to another 
6 points in space or time is quite easy. On the other hand, 

some kinSs of experimen^l work ,in astronomy are virtually 
impossible. Bits of^the heavens cannot be isolated in or- 
der to take them into the laboratory and change the con- 
jditions to see what happens. 

One of the triumphs of late 19th and early^20th cen- 
tury science is the devising of mathematical techniques 
•and exp^^rimental tools for work ip fields where the actual 
nun^er^'of individual bits of experimental materials is not 
vast as it is in chjemistry, or regular and beyond reach as 
in astronomy, but relatively small in quantity and able to 
be manipulated. Some of the best work in this area was in 
the field of agricultiyral research, , and the classic studies 
' in research design now widely applied in the social sci- ^ 

ences still refer to t^hese methods. A standard reference 
is the work of Fisher (^1935) , who summarized the method- 
- m ^ ology recommended in the 1930s. 

2see alsolbh^hacl Pat- • - In this approach ,^ an experiment is defined in terms 
ton's Alternative EoaL-i- of taking twto' populations, selected at random, doing some^ 
ation Research Para<U'jna thing to one of them, using the other as a control, and > 
-in this scries.-- - comparing the two before and after the treatment. By im- 
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plication, this approach becomes the method of doing re- 
search, the only method of arriving at new or certain 
knowledge. 

Because of the nature of the populations available 
irf social science research, two problems — that of selec- 
tion of experimental and control groups, and the rela- 
tionship betwefen experimental treatment and results — 
i^ssume an enormous significance. I want to discuss this 
research paradigm in terms of its relation to education- 
al evaluation. 



♦One of my favorite sto- 
ries about research work 
concexTis ja bright young 
biologist who wanted to 
repeat some experiments 
carried out by an estab- 
lished researcher on a 
particular strain of mi- 
croorganism that the 
older -worker had isola- 
ted and cultured. The 
young man wrote to his. 
senior colleague asking 
for a sample » and was 
turned Sown. He contin- 
ued to write, constantly 
renewing his request , 
although all his letters 
were received with nega- 
tive responses. \Vbcn a 
colleague asked the 
youn^ biologist why hc^ 
kept repeating his rc7 
quest when he should 
know that the answer was 
going to be "no," he re- 
plied, "I know he will 
refuse me, but his let- 
ters are written near 
his lab, and everytimc I 
get one, 1 cut it up »n-'' 
'to little pieces and sec 
what I can grow from it 
on ag^r ^ lates.' Sooner 
spr .later, I'll get my 
organism This imagi- 
native anti outrageous 
bit of methodological 
i:atfatcgy ^jA^st doesn't 
fit into th^Niwdels of 

neat experimental^ j: 

design. ' \ 
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Appliodbility , 

In-this research design, considerable energy is ex- 
pended devising ways to arrive at a random sample to make 
sure that the population studied is some ave*rage general 
one, not the result of some , prejudice or odd local factor 
that might influence the result. A good deal of agricul- 
tural research exemplifies this point. If you want to 
find out the effect of a particular fertiliser on com 
crops, you must be sure that you don't confuse the ef- 
fects of fertilizer with the effects of rain or weeding. 
Also yow want to know just how great the effect is. 

tSe influence of this particular way of doing re- 
search in educational work is indicated by the writing 
dn the field. Many theoretici^s and methodologists de- 
scribe it as if it were the^orCCy possible way pf doing 
research. In "Experimental ahd Quasi-Bxp3rimental De- 
sign," a highly respected outline of research m^thodqlo- 
gies (Campbell and Stanley, 1963), the authors recognize 
that there are many cases where the 'standard' of 
Fish0r-type experiments cannbt be met, but they make it ; 
clear that such situations are, at best, 'quasi-experi- 
mental'. 

The basic problem is not vith Fisher's or Wendell 
and Stan]ey^s definition of an experiment. They are at 
liberty to define this as they wish. What is frighten- 
ing and limiting is the further suggestion that experi- ' 
ments defined in this way are the only way, to acquire 
knowledge, or that, no matter what the situation, every 
effort should be made to structure situations so that 
experiments of this kind are undertaken.* 

An example from the evaluation literature illus- 
trates the contempt^ which persons concerned with educa- 
tional evaluation have ^for whbl*e fields of scientific en- 
deavor. In the AERA monograph (1967) mentioned earlier, 
Michael Scriven writes: 

We njight for example be interested in the propor- 
tion of the class period during which the teacher 
talkG,'the amount of time that the students spend 
in homework for a class > the proportion of the 
dialogue devoted to expliiining, defining, opinion,' 
etc. (Milton Meux and B.O. Smith, 1961). The 
great problems about work like this^ ai^e to show 
that it is worth^ doing, in any sense. ' Some pure 
research is idle research, ^he Smith and Meux 
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work is specifiqally mentioned because it is clear- 
ly original and offers promise in a large number of 
directions. Skinner's attack on controlled st;udies 
' and his emphasis on process research are more than 
offset by bis social-welfare orientation which en- 
Guies that the process work is aimed at valuable im- 
provements in control of learning. It is difficult 
to avoid. the conclusion, howeVer, that most process 
research of this Hiijii in education, as in psycho-' 
therapy (thoagh apparently not in medicine), is 
fruitful at neither the theoretical nor the applied . 
level, (p. 50) 

The implication here is that 'process research', 
that is, field studies based on observational methods, is 
not even worth doing unless it is offset by a particular 
social-welfare orientation. 

This is a rather harsh judgment on a large number 
df scientific fields, which might have particular'- rele- 
vance to evaluation. Anthropologists, archeologist^, 
ethologists, a whole range of social scientists do not do 
'controlled experiments'. The basis of their work is in- 
formed, observation , and it 'is remarkably fruitful, Jane 
Goodall (1971) watched a small number of individual chim- 
panzees over a period of years ,^and in .the course of her 
observations discovered the ape^ using^ and even making, 
tools. She could not possibly have developed a control 
groups experiment; in fact, she would have had no reason 
to set up such an experiment, even if it were. possible, 
bejcause tool-making v^as not part of the expected behavior 
of chimpanzees. Because she is in a fijsld that accepted 
open observation without a specific predetermined behavi- 
or being measured, she could make her scientific discov- 
eries. 

Another difficulty in trying to apply the agricul- 
tural, experimental researdi method to evaluation arises 
from the fact that evaluation is riot a laboratory re- 
search activity. It is performed in the field; that is,^ 
in natural settings--in schools with live children and 
adults. This makes ^he whole problem of randomization 
extremely difficult, because the total environment can- 
not be manipulated for either experimental or control 
groups. ' ; • ^ 

' What Rata i& Gem^vated. 

The fact that this experimental methodology i*5 
hard to apply is not, of course, sufficient reason to 
question it. But one can ask whether the information it 
yields is worth all the trouble. The kind of results 
that can be obtained T)y applying experimentai designs as 
described by Campbell and Stanley give o^ly a s;nall part 
^of th'e types of data that- are needed to make educationa 
jUd^ents.* This method focuses on comparison of aver- 
ages ,^^ans, total sample gains, and generally on trends 
which'' apply^o the group as a whole, it is not designed 
to focus on^Miyidual members of the experimental sample 
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For example, in determining the effects of ftjx^tilizer on. 
a crop you hook not at individual plants, du£ large fields 
and the weight of the resultant crop. It m^y be the. case 
that a fertilizer-produces amazing results by stimulating 
90 percent of the 'individual plants, although it^kiljs 
the remaining 10 pexcent. This can still make a fertili- 
ser liighly desirable. But imagine a similar situation in 
education! ^ 

In some cases, it is useful to obtain the kind^of 
data that is generajgd by applying the experimental psy- 
chology model of research. The Plqwden Report (1967) pro- 
vides an example. of a situation where data gathered by the 
statistical research lethods of 'experimental design' was> 
exactly* what was needed--within a context. One question 
the Plowden committee asked was simply: Iwhat is the gen- 
eral level of reading attainment of English diildren ^ 
compared tS the same group several years earlier? By 
giving :>tandardized reading tests to a fstirly small 
sample, the committee was able to determine th^ reading 
levels for the whole population had increased over the 
time period measured. Tl>at is a case where the question 
asked was best answered" by this kind of impersonal, aver- 
aging procedure. The desired information was general and 
impersonal, and it only required sampling a small frac- 
tion of the entire school population. Still the major 
work of the Plowden Committee, to analyze the status of _ 
primary education in Eingland ^d make' recommendations for 
the fixture, depended on interviews, observation, and con- * 
Crete exajpples. ^ 

view of Causality ^ * 

Another pro.blem with the style of the dominant re- 
search paradigm that it is based on a rather naive and 
Simpliiitic notion of the nature of causal relationships 
in social sittiations. The basic premise, derived partly 
from the behaviotistic outjook of many research predeces- 
sors and partly fM^the kind of methodology that is ad- • 
vocated, is that 'there will be fairly direct and immedi- 
ate results from particular actions. ' Introduce program 
"a," teaching method *'b,'* or organization of classroom 
"c," and it will be possible to see the effects fairly 
directly and separately from other*events that may happen. 
It is >easy enough to see how this view, can <be applied in 
the study which served as a model for tfiis type of re- 
search. Plants are relatively simple biological species, 
they don't particularly interact with each other, and 
they have relatively few degrees of structural freedom. 
Therefore, they respond simply and directly tp specific 
changes in conditions. If you water them more, they grow 
more. If you fertilize them, they grow and produce more, 
etc. But people just donH respond that simply to sti- 
muli, especially not in open, natural social settings. 

The kind of methodology classified, as 'experi-** 
mental' by Campbell and Stanley is based on the assump- 
tion that the effects of actTons can be isolated and 
measured fairly directly, and that what is going to hap- 
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pen can be predicted with enough precision** to look for 
that result and directly relate it to isolatable' situa- 
tions, programs, and activities that you wish to evalu- 
ate. It is possible to try to arrange human studies to 
approximate the simple conditions of pl'ant life, but be- 
sides the terrible moral issues which then come into ^ 
play (see below) , to attempt this is tp destroy the real 
wo.;ld situatioh of ongoing school activities. It is this 
ongoing 'work which is the proper subject of evaluation 
studies. 

Moral Issues ^ 

A crucial issue of any research strate'gy, especi- 
ally any which involves living thi^igs, is the moral pro- 
blem involved. What does -It mean to do any sort of re- 
' Search that includes humans? There appears to be a com- 
mon miscori,ception that as long "as research follows pro- 
per methodology^ such questions are resolved or at* least 
minimized. Certainly some of the writing abqut proper 
methodology appears to igno^ the implications of these 
positions. For example, in developinj^ an argument to 
show that it is possible to do comparative studies even 
in cases where absolute resuljts cannot be obtained ^ 
Scriven* states: * ^ \ 

The analogy in the medical field is not with^d^ug 
studies, where we are fortunate enough to be able 
to achieve double-blind conditions, but with psy- 
chotherapy studies where the therapist is obvious- 
ly endowed Vith enthusiasm for his treatment^ and 
the patient cannot be kept in ignorance of whether 
he is getting some kind of treafment. If Cronbach's • 
reasoning is correct, it would not be possible to 
design an ad&quate psychotherapy/ outcome study. 
But it ic possible to dLesign such a. study, and the 
way to do it— as far as this point goes— is to yse 
^^^more than one comparison group. If we use only 
one control group; we caimot tell wl^ether it's 
the enthusiasm or the experimental technique that 
explains a difference. But if we use several ex- 
perimental groups, we can estimate the size oC the 
enthusiasm effect. We make comparisons between a 
number of therapy groups, in each of which the , 
therapist is enthusiastic, but in each of which 
the method of th^apy is radically different^ As 
far as possible, one should employ forms of therapy 
in which directly incompatible ^procedures are adop- ^ 
ted, and as far as possible match the patients al- 
located to each type (close matching is not impor- 
tant). Thefe are a number of therapies on the 
market which meet the first condition iit several 
dimensions, and it is easy enough to develop pseudo- 
therapies which would be promising enough to be 
enthusiasm-generating for some practitioners 
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(e.g., newly graduated internists inducted into 
the experimental program for a -Short period). 
The method of differences plus the method of con- 
comiztaht variations (analysis of covariance) will 
then assist us in drawing conclusions about whe- 
\ ther enthusiasm is the (or a) major factor in 
therapeutic success, even though double-blind con- 
ditions are unobtainable, (p. 68) 

The implications of having eag^r, inexperienced, 
young internists practice pseudo-therajiies on innoceiTt, 
but troubled; patients involves serious .moral questions. 
Similar suggestions have been made and should be resisted 
in ecjufation. There are cases where children.are simply 
denied b^efits for the sake of setting up-? control 
group, or whea-e for the sake of completing the experimen- 
tal activities children and parent's are not fully infor- 
med of -a program. Adherence to a 'good' research design, 
that IS, one that is methodologically 'sound, does not 
even begin to address any of the i^oral questions that^ 
come up in a particular research activity. 



AiStERNATIVE strategies of MEASUREMENT 



There are other scientific resej^rch methodologies that 
provide approaches to the colllsction of data, which are 
particularly appropriate for many evaluation studies. 
Increasingly in the last few decades research on schools 

' has used the approach of the anthropologist, the ques- 
tioner of culttire, to examine what happerfe and to de- 
' scribe, tabulate, appraise, and finally, judge or evalu- 
ate education (Kimball, 1972). A good part 6f this work 
was inspired by Jules Henry (1963) and his general an- 
thropological -approach to looking at institutions. Since 
then, a number of people have applied similar methods. 
Philip Jackson's book. Life -yn Classrooms (1968), is a 
proper, scientific research study, but its research me- 
thods come from a ^different ^fie^d than behavioristic psy- 
chological research. 

The power of the anthropological approach can be 
estimated from the impact that' this type of study has had 
on American education in recent years. Serious discus- 
sion of the educational saone has been generated by the 
descriptive indictirents of the schools contained in books 
That range broadlv from popular and impressionistic works 
like those of liei-ndon (1965), Kozol (1967), and Kohl 
(1967) ,^ through the persoft:^ly analytic like Holt's Hou 
Chilh'en'Fail (1964)^ tothe more scholarly, such as 
Jackson's book and Ray Rist^s The Whan Sohool (1973). ^ 
All 'these efforts have two things iir corjnon. They de- , 
scribe tFy schpols from an anthropological-sociol^gioal 

. perspective, and they paint an intensely gloomy picture 
of school life. No experimental study, in the fisher 
sensj, could provide this, information. 

If educator^ and the public want to evaluate a 
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school or some aspect of school life pr individual chil- 
dren's .growth and leamingAthey have to apply the tools 
that will give them the mostXand the besj; information. / 
This requires surveying the. eWire .field 'of social science, 
to pick out what is appropriatV For anyHRajor task invol- 
ving evaluation of new activ.itffes, it probably alio ijieans 
inventing new ways of petting tk^ information. 
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Classification ofEvcluation in Education 



The term evaluation is applicable to a range of activities 
which' require judgments to be made. For the purposes of 
, our discussion, it iji possible tCK organize and discuss ' 
these activities under three headings. Each \as its own 
proDlems and its own techniques, but they also involve si- 
milar issues. 

First is the gen^^il area of evaluating the growth 
and development of individual children. This is, of 
course, what Americ^ sr'iools are about, at least what 
they are. supposed to be about fonnally--fostering the 
growth and development (the learning) of individual chil- 
dren. This is also the area in which the experience of 
half a century of tests and measurement in "experimental 
psychology h/s been most directly a|jplied.. 

A seaind level of evaluation concerns judgments 
about the perform^ce of variou,^ other people in the 
school system. E'valuatcxrs ask questions about how well 
teacher? are doing, how well principals are acting, who 
should be hired as a school superintendent, etc. Judg- 
ment^ of thia sort are related to, but not identical with, 
knowledge ^tbout children's growtji and development. The 
Hnd of .|<fb that is Jbeing done by the teaching "profession 
as a-wlible, or the Kind of service we are getting from ' 
school administrators as a group, is and should be reflec- 
ted iii tl/e reports we get concerning children's growth in- 
to healthy and competent advlts. On an individual level, 
this jgfeneralization breaks down--there are simply too many 
fact/>rs involved. It might be possible to draw conclusions 
aboiit the type of health service in the United States from 
s/rv^^ys of the general state of health of the population. 

/Would be much more difficult to make judgments about 
tyk competence of each individual doctor on the basis of 
The average or general health of her patients. 

Because the measurement of individual children's re- 
sults on sjj^andardized tests is sometimes the only concrete 
evidence that is gathered ^out the way a teacher is doing 
h^t yoh^ there is tendency' to judge her on the basis of 
those/t*. ^,ults alone. Making such judgments on insufficient 
dat^, and without careful thought is a rather dangerous 
practice. Tne recent fad for performance contracting, in 
which school personnel and programs were judged by the re- 
sults as measured/ by pupil performance, is ao evample of 
this practice. /The unhappy events that roZdW ' (Report 
to Congress, 1^74) paraLlel what occurred in Enpland late 



in the 19th century whfen teachers were paid according .to 
.he examination results of their pupils. In both cases 
the system encouraged a surge of improprieties: teachets 
and administrators saw to it that a minimum level^ of •per- 
formance' was guaranteed, no matter how it was oqtq^ned.^ 
The third general area of judgmertts is the evalua- 
tion of programs. At the minimum level, this involves 
judgments about a new curriculum, some educational inno-* 
vatif n, or other specific. program brought into a school, 
such as Title I, Title III, NSF-sponsored »$cience and ma^h 
'curricula. Much of the recent stress on the importance 
and necessity for evaluation is a direct result of the 
federal expenditures in education which were directed 
programs of this sort^ Here the situation is similar to 
the one that prevails in judgments about school personnel. 
Decisions are desired concerning total programs; the me- 
thodology that is available is about individual or aver- 
age pupil achievement on standardized tests. Attempts tb 
connect the two by some simple cauSal relationship may not 
be applicable. There are a number of such 'ttiscrepancies 
between federally sponsored programs and standard evalua- 
tions available. Many of these programs have goals that 
are quite different or much broader than those encompas- 
sed by standardized tests of achievement developed with-m 
the psychometric models available. Yet, there is often 
an attempt to assess the programs in tetms of these tests. 
To refer again to our medical analogy, it is as. if the 
success of a variety of comnunlty healthy rograms were 
all measured on the basis of a set of standard measure- 
ments on individuals concerning their general health; 
blood pressure, weight, number of operations, etc. -^his 
mi^ht be an appropriate, althouglj not a sufficient, ev^al- 
uation for community health programs related to sanita- 
tion or diet, -but it. would certainly not be the most use- 
ful information for judging a program that had as Us 
goal the deve^lopment of mefttal health centers or family 
planning, or addressed other broaffy conceived health 

issues. . _^ . . c 

Program evaluation also Concerns integral p«ts.ot 
school organization within a single po.icy unit. Jhus, 
for examp.e, a new curricula or program may be tried out 
in several classrooms, or within one school, or in part of 
a district. Again, evaluation judgments have to be rade, 
and again, the results for individual children are part 
of but not all, the information needed to make a judg- 
ment A complex of social, political, and economic fac- 
tors need to be taken i * .o account. Ai\ education program 
that, for example; inci, -ised reading scores fot children 
at the expense of their physical health would be highly, 
unlikely to be acceptable, no matter what the standard- 
ized test results show-d^Similarly, any program or in- 
novation that disrupted/^he health of the school system 
or the functioning of the community would havd serious 
problems; the judgniijits «out it would reflect this, re- 
gardless of what the intellectual growth and developmer . 
of individual children might be. 
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Questions about general educational policy repre-- 
sent the most complex level of programmatic evaluation. ^ 
These involve decisions about curriculum or -educational 
goals and how they are determined locally or nationally. 
It should be most obvious in this instance that results ' 
obtained on standardized tests of cliildren's academic 
achievement are only a fraction ^f the information 
neededrfor sound judgments. A major function of every 
school system is the socialization of the young into the 
society. Only a fraction of that socialization is con- 
cerned with academic shills; standardized tests are not 
complete measures of academic achievement . Thus, ut is 
simply not possible to make judgments about the important 
functions of the schools on the basis of individual pu- 
pil performances in achievement tests, apart from their 
social and cultural context. A similar critical apprai- 
sal of the ^relationship between standardized testing and 
program-related assessments is found in a recent publicar. 
tion jointly sponsored by the National Institute of Edu- 
cation and the National Council of Teachers of English 
(Venezky, 1974). 

The reliance on achievement testing of children to 
evaluate a wide range of educational practice is so re- 
markable that one has to woWer why sensible people would 
even advocate it. Why should a teacher, who has respon-r 
sibility for many things besides the academic achi^e- i 
ments of children,^ Ke judged only by that achievement inV- 
dependent of her working conditions, support, local pro^ 
blems, school system goals, social pressure, and her abi- 
lity to inspire or teach or guide or socialize children 
as is proper for that community? Why should a school sys- 
tem, which is charged with keeping children out of trouble, 
satisfying a community expectation, providing recruits for 
the labor market, training consumers, and a host of other 
tasks, be judged only by the achievement on standardized 
tests pf the children in the system? 

It is possible to mak^ some judgments about the na- 
ture of the society and the nature of the role school 
systsms play from comparative ach^ievement data between 
parts^ pf the population. Perhaps the most striking value 
of the achievement tests, which are so widely used by the 
schools, is that they give solid, 'objective' proof that 
the schools support the racism and discrimination that 
exists in American society. The one standard measure that 
our society U5es in judging opr children and our education 
system shows conclusively that we have created a system 
that hurts a large fraction of the population, much .of it 
black, and most of it poor. The fact that broad categor- 
ies^f-^iidents--urban blacks or" poor whites--have sys- 
tematic t^-mrrraal distribution of test resufts, on tests 
that have been designed to provide normal distributions, 
clearly illustrates that our society is treating groups 
pf children differently and then determining their future 
on the basis of this treatment. / 
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What Do We Know About 
Children, and Why? 



The appropriate basis for developing any program to evalu- 
ate children's growth and development is to decide what it 
is, in fact, a particular audience wants to and needs to 
know. Also central to' any evaluation decision is the 
question: for whom is ths information being gathered? 
For centuries there was a struggle to ffee science from 
what appeared to be irrelevant, and often stifling, poll- • 
tical and social considerations. Unfortunately, this 
struggle, along with the general scientisra of the late 
18th and 19th centuries, led to the .belief that whatever 
is studied in a Vic-.t ifiC manner is divorced from any 
social or political considerationfi' whatsoever. It even 
ra^e^thft question-who wants to know and why?- -an irrele- 
vant onr.-~^iIuring a" period when science was the plaything 
of educated gmtiimen, this disregard for social- implica- 
tions of the uses of- sci^ce was perhaps- possible. But 
recent history has made us awaT«-^f the social and poU- 
>»,tical u'ses to which various forms" oTf scientific enterprise 
have been directed. We need to be concemed-^iith stich 
matters as who is interested in poison gas, or who xants 
to know about psycholtfgical methods of persuasion, or wh/' 
a government agency is collecting data about citizens. 

Although the motive;s and reasons for obtaining edu- 
cational evaluation data are usually not as sinister as in 
some of the exampl'e<5 I have cited, we also have to ask who 
wants particular information about children and why it is 
requested. The purposes of an evaluation effort and the 
audience for wh3m it is intended is often a guide to what 
is, and is not appropriate information. - . ' ^ 



THE NEEDS OF TEAQIERS . 

1 One area of concern for teachers is whether children 
^ JLoarning-dlrect , speci f ic_ski 1 Isj^ ^ sovm^s ?fjettjrs , 
mathemati/:al operations, rules of kickball, or"how-f^o 
look up the spelling of a word in a dictionary. This kind 
of infoimatiori, in many cases, can be easily determined ^y 
usirg standardized tests. It is quite possible to give a 
.child a test that will determine if she can read^the word 
'"bair OP give the correct answer to the question ^ 
3 + 3 = ? But in most cases, a teacher can also obtain 
the same information quite easily in otHer ways in the 
course of day-to-day contac;, with the children. Any 
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teacher'who has childreji read to her will get a reason-^ 
ably good idea what words a child* knows. Any teacher who 
plays a boatu game with children that us^s dice, for ex- 
ample, will find out about a child^s ability to add num- 
bers u]p to 6 4 6. • 

In discussing the use of reading tests, Venezky 
states: ^ 

• The number of different'' instructional groups into 
which students are placed is generally small, and 
the differences in predictive ability of even the ^ , 
most extensive formal tests over informal teacher 
judgment have never been shown to be large, (p. 7) 

2. Information is needed about children'^ more fun* 
damental growth through stages of development. -Increasing 
vocabulary or learning more 'number facts' does not cons- 
titute advances in the kind of thinking the child^an en- 
gage in. On the whole, it is not- possible, niSing simplre 
questions with multiple choice aiiswers, or true and false, 
ot fill in the blank, to obtain information about how an 
answer was arrived at, the reasoning process that was used 
to arrive at an answer, or the levels of complexitv that 
are- involved. For exaiyple,, it is fairly easy to devise 
a method to determine how long a number anyone can remem- 
ber. You simply ask the person to repeat a number back 
to you, starting with a ,one digit number, then a two di- 
git number, and so on. But if you want to determine a 
person's reasoning process, the task becomes much harder,^ 
if it possible at all. As problems become mote com- 
plex an^ more interesting, the ways to attack them also 
increase in complexity and in number so that no matter 
how carefuHy you structure the problem pa parts/ there is 
simply nc way-- looking only at the answers --to find out 
how a person arrived at the various responses to complex 
questions. Any experienced test taker knows the strategy 
which argues that a particular 'answer must be correct (or 
incorrect) because it is the' kind of answer that would be 
expected by that particular test or tester, or because the 
answers on ^this type pf test are bound to be whole numbers, 
or becausfe there wouldn't be two similar answers, or--a 
^rategy that a friend of mine swore she used with great 
succes^s— *'in all multiple choice exams, if one answer is,- 
si£nificantT>r l^er than the others, it is always the 
fight answer, because i^^^ne would bother to make up a 
long wrong answer." , --^^ " 

3. Horizontal Growth: Related ^feojhe questibn of 
developmental growth--how Children think„ hovr^eomplexly 
they can, approach a problem--is the issue of horizont^-,. 
growth discussed earlier. How rich is the experiential 
base and how rich is the thinking on any one level? ThiS' 
is such a subtle and under-explored area of development 
that there are obviously no simple ways to get at ques- 
ts >ns c,about it. 

4. Another area of concern for teachers is how well 
children can use the skills they have.*- There is no simple 
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correlatioa between mastery of vocabulary aird syntax, 
and actual reading done, or even reading >irith comprehen- 
sion. This question of use of skills requires bot the 
skill itself and some sort of context in wh^ich to use it. 
Fdi reading, it requires actual readijag^; for math,- quanti- 
tative manipulations; for art ^th^ production of something 
expressive; for crafts, construction of objects; for 
sports, participation in the activity. It is not clear 
that situations that are specifically designee to test 
the use of a skill outside the cdntext of actually doing - 
something have much relationship to that skill. . It is 
certainly not enough to look at the results of tests to 
find ou^ whe,ther children actually use certain skjU^. 

5. Finally, there is the question of '"^'earning to' 
learn', br learning problem-solving, heuristics, or any 
of a number of terms that have bee^ applied. Increas- 
ingly, educators are becoming aware^ th^t education should 
strive to develop in people the ability to take care of 
themselves, to undertake ^heir own continuing growth and 
development, to deal, effectively with situations- not 
previously encountered. 

V InfoTmation- that_is useful to teachers is direct, 
inmiediate, and specific aboxrt-th£_chUdren in this yearns 
class. From the viewpoint of an elemetrt«y_school tea- 
cher, the disadvantages^of national standardize3^~tresti_^ 
far outweigh their advantages. Of che five areas discus^r. 
sed above, only the first is covered in theke testing 
program^. But even this information is giveti in norma* 
tive tertas, comparing a child to some national sample, 
•rather than in imiividuaV terms, helpful in working with 
'that particular child. Also, it takes an enoimous amount 
of time to get back the results. At the same time,the^ 
tests disrupt the educational work in the classroom, de- 
moralij^e and disturb the children, and disrupt the help- 
ing telationsMps established among the children and be- 
tween the children andnhe ^teacher. 

THE NEEDS OF SCHOOL .ADMINISTRATORS 

The situation changes when we Look at the kind of infor- 
mation that school administrators find useful. It might 
be hoped that school principals, as part of the task of 
supportin'g teachers and being concerned about the educa- 
tion of children, would be interested in the same sort^ 
of things that teachers need to know about children. 
A^ctually, most American school principals are not head 
teachers, but organizational administrators; they are con 
cemed with staffing, busing, and discipline, fhey don't 
have the time, training, or, for the most part, the in- 
clination tp teach and to be involved in the growth of 
individual children. Since principals, and the rest of 
the administrative ladder a 3cho61 system, see them- 
selves as supervisors of a system rather than guides for 
individual children, they need system-style data: fehort, 
concise, and easily compared. Also they are concerned 



with trends: How does perfomiance or learning, or any 
other measure, compare yeaij-to-year? How does it relate 
Xo expenditure or to changes in practice, etc.?, ^ 

Concern for trends snl comparative data is neces- 
sary within any school system. It is necessary to. know 
as precisely as possible howSa particular practice affects 
Results; that is what evaluation is all about*^ Unfortuna- 
tely, for the reasons indicated, the information obtrined 
from standardized testing is , simply inadequate for many 
of the decisions for which it is used, or at least for \ 
which its use is claimed^ 



TOE NEEDS OF PARENTS 

• " w 

Parents are also interested in information about the 
school. On the one hand, pareht? want information about . 
ho* their own children are pr^jjflressiiig in school, what 
they do, how they behave away*^fe^a the home setting. Oj^ 
the other hand, as community |tteiab«ts and tax payers, th^ 
want, comparative information an^l information on trends in* 
the schools-at-large. In sitaltions where they have be- 
come involved, parent conceafffi^^ally go far beyoi^ the 
limits of what is providediBy 'Standardized tests.,,; v 

Parentsjwant to. knoir whaWtheir children's -Ranees 
of success in life might be. -I| is o«t»aujxg«Mthat 6ne 
of the reasons sckool systans need staJliizTdi^M^Xesting 
is to give parents' this information*'5-i£ w^»^rs didn't 
have these scores, parents would not hiV^<5;1^pilli U4ea of 
what their children's education wM^-^woxtTjyl^rfd^^ could 
do for their children in the long run • , P¥tjSf}ii^.piom 
parents is often said to be ^:he reasf^ readirt|fJ^Sto^s are 
stressed because parents believe* tlSa^ i^eading itafm are , 
rMated to future success ,,*to i^ttii(^ into. college,. etc. 
It is ironic that school persgpnel should point to par- 
ents as the force i:hat sUppQttS thf tests, bec^e^^ it w*s 
the school administrators arid aca^mic experimenters >*0, 
originally sold the tests to the parents. The great twnA 
towards quantified statements of* school ^performance caae^ 
not from parent and ^comitfiifi it y .groups but froiu the s^eiir 
tism'of the academy in the. ea1*ly years of this cenl^ty 
(Cremin, 1961). ^ s \^ . 

The problem wit:h 1che belief that Individual high ' 
test scores lead tQ success in society is that it:intjr6- 
duces the lottery concept into education. The pos;pibility 
of high test scores is held out to low-income par^ts as 
a way to provide a great future for their children when, 
in fact, it would take a very hi^ score indeed /to change 
significantly the life chances Of poor children. The re- 
lation of school to ^college admission, }obs and income 
is a complex one closely relat^^ to the prejudices and 
discriminations in our society (Berg, 1970). It is 
true that an unusually 'high-achiqjfing' child from de- 
prived circumstarices--that is, a child who does very well 
on the standardized tests--can bxeak out of the bounds of 
the economic and social class in whicK" she lives, and ac- 



tually change her stat^^ But the odds against this ari 
enormous. This kind of cas~e^--«iut there are ^ome'all the 
time— has the same effect on redistribution of classes tn 
society that the lottery has on redistribution of income. 
1lre-i^U^ in. Massachusetts, for example^, provides 'about 
a 13,00M0O-^tef^ne^chance against winning $1^ million. 
-That means that afteY^i^million tickets are sold, one 
person may significantly chjuig^liex economic status. 

There are gust enough winners of smaller amounts so 
that many people can support the illusion that they too 
may be a winner, that they too can change their status. 
But, of course, i-he actual number of people who do win 
something is so small as to be insignificant for any 
change in class alignment. Exactly the same reasoning 
holds for the^concept that good reading scores will help 
populations break out of poverty or oppression. The ac- 
tual number of children who can change their status as a 
result of school success is trivial compared to the to- 
tal population that, is condemned to poor jobs and con- 
tinuing poverty. .-^ 
An analysis of the kind of ^-information different 
groups need leads to the conclusion that the present^s^s- 
tem of reporting children's standardized achiev^fient 
scores, at best, only assists school administrators, and 
only in one of their functions: that of acquiring data 
for long-term planning and assessment. Even in this are 
the results available are one small component of the in- 
formation that is needed for intelligent decision making 
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The. Measurement of Child Achievement 



Because the measurement of childrens' achievement repre- 
sents the most widespread standard practice in the schools 
because it is often the basis foy many other judgments, 
and because it is at- the primary level of the evaluation 
hierarchy, I would like to examine it in more detail. 

STANDARDIZED ACHIEVEMENT ^':STS ^ * - 

One of the most remarkable features of the present 5tate 
of the measurement of child gi'owth and development in the 
schools is that while all thou^tJul^ducators agree that 
the av^iable test^-are terrible, almost everyone con- 
tinues to use them. Meeting in Washington in 1972, edu- 
cational sponsor^ of Follow Throug^programs agreed unan- 
imously that the available test*=* were inadequate to mea- 
sure what was happening to the children in Follow Through 
classrooms. Strong criticism was expressed not only by 
the sponsors who advocated more "open" programs with em- 
phasis on varied learning style, affective development, 
and social concerns, but also by the sponsors who ^idvoca- 
ted more traditional prograi^. ..Many of the sponsors were 
highly critical of Stanford Research Institute, the or- - 
ganization hired by USO& to conduct the official overall 
evaluation; for not developing more imaginative and use- 
ful measures of child growth and development! Yet, for a 
number of reasons, including social and political ones, 
many of these sponsors actually used identical tests in 
their own evaluations! 

The usual generalized argument given in support of 
continuing to use recognizabfy inadex^uate tests is that 
there is nothing better available. This argument is a 
sound one if an activity or a process is simply not as 
good a§ it coulyd be — that it is inadequate. But ciU^ti- 
cism of standardised achievement .tests goes much further 
that that: the tests are not only not good enough, they 
are harmful aihd, destructive to a number of scliool pro- 
grams; they ire especially harmful to children. 

A nunAjer of specific points, some of which I've 
touched on in passing, ban be made in this regard: 

1. the present tests are discriminatory. They 
have a sti^bng socio-economic class and sex bias, and they 
favor middle-class society and norms at the expense / 
poor children and children from cultures differing froni 



the majority, middle-cLass, Anglo-Saxon culture of the 
United States. A blatant example of the sex di^rimina- 
tion in standardized achievement tests is illustrated by 
a question in a primary level MAT which shows ^ outline, 
drawing of a man in a long coat, with a small mirror at- • 
tached to his forehead, examining a child. The correct 
answer for the work that describes the person pictured 
must'be chosen from four choices that include both J^doctor"* 
and "nurse." This not only enc"ourages the stereotype, that 
doctors are usually men, it penalizes a child who knows 
that male nurses exist. 

The discriminatory nature of the tests towards 
other cultures is evident on inspection. 'On the whole, 
the tests show white, middle-clasSc chi Idren performing 
stereotyped activities which can be recognized by con- 
ventional symbols and the language used to describe them. 
Strong evidence for the confusing and 'limited nature of 
the tests is found in a pamphlet by Deborah Meier (1973), 
a teacher in New York City who discussed the tests with 
children, in school. She found tha't the children were con- 
fused by the questions^ by the unfamiliar language used, • 
and by the situations depicted which were not appropriate 
to their experiences in life. For example, a question on 
a, primary MAt shows a smiling girl cai;rying some bboks in 
the rain. The correct answer, to be chosen from one of 
the three sentences that describes th& picture, is 
'*Mary*s books will get wet in the rain.^' Bqt , the New 
York City children argued, this could not be the right 
answer. She would not be smiling if her books were going 
to get wet; so they chose things Itice-^the-Hfvain will not 
hurt the books" or "Mary is taking good care of ^her , 
books," the two other possibilities. Many examples can 
be chosen, the point being that a significant number of 
items on nationally iTsed standardized tests are confu- 
sing, and choosing the correct answer depends not on read- 
ing ability alone (which the tests are supposed ^o mea- 
sure) ,^but on knowledge and acceptance of cultural norms. 

2. Standardized tests simply are inappropriate for- 
whole categories of educational settings. The very nature 
of the, tests, the way they are given, th^ way they are 
graded, and the way their results are used is antithetical 
lo more cooperative open styles of education. This point 
has been carefully made by Margaret deRivera (1973). She 
lists a series (5f ways in which the test situation itself 
is incompatible with Ojjen education practice: 

1. Gyen.olasQYcom\ Children are ehcouraged or at 
least allbwed to share, to converse, to help one 
another. \ ^ 

Testing aituati^on: no talking, nq sharing, no 
helping one another. 

2. O-pen olassvoom: Children exercise and demon- 
strate their knowledge and skills in many differ- 
ent modes: verbally, by action, dramatics, wri- 
ting, etc. ^ ' 
Testing situation: the children's response mode 
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is limited to reading, listening, and 'marking. 

Knowledge and skills which they ^re used to ^ ^ 

exercising in one mode have to be^translated to ^ 

the mode of response that fits the test. 

3. Open alasaroom: generally flexibility is such 
that children can finish nfost-^^ftsics they begin and • 
can go on to somethin^^ ^Ise when finished. Chil- 
dren can move around the rocnn. ^ 

' I Teetin^^iiuatton: no moving on t>o the next task 

when finished, often not ^ough time to finish a 
task. Children must remain seated at a desk. 

4. Open alabsroom: children generally work at 
many different tasks, so that comparisons are 
not easy and competition is not encouraged. 
testing situation: children wark on the same 
task at the ^ame tiae so that comparisons^ are 
facilitated. 

5. Open alassroom: each child is viewed as a 
complex, unique individual, having strengths and 
weaknesses but essentially qualitatively differ- 

^ ^ ent from others. 

•* Testing situation: quantitative differences be- 

tween children are important, qualitative differ- 
' ^ ^ ences are lost. Success is defined by others' 

failures. (The 60th percentile means that 60 
petcent of the children in that grade score be- 
. low.) 

6. Open alassroom: the child is glven,Je^n ing 

' experiences designed to develop a self-imag^ of 

a competent, effective, succes§ful person. This 
is considered an important attitude for effective 
learning. 

Testing situation: the very children (those who 
areweakest in skills) whq need the support of a 
positive self-iraage in order to continue learning, 
are discouraged and frustrated by failure. 

7. Open alassroom: thoughtful, critical thinking 
is encouraged. , 
Testing situation: often random guessing is a 
more successful strategy than thoughtfulness since 
the tests are limited in time. Thoughtfulness is 
not rewarded. 

8; Open alassroom: intrinsic motivation (i.e. 
^ learning for learning's sake) is considered the 

most effective motivation for long- term learning. 
Testing situation: extrinsic motivation (i.'3. 
learning for some outside reward), is encourajjed; 
learning in order to pass the test. \^ 

3. Some serious questions are inherent in the method- ^ 
ology used to prepare standardized tests. I have already 
, • t discussed some of the general implications of the experi- 

mental methods on which standardized tests are 'based, but 
there are even more detailed problems associated with 
^ them. The standardized tests in use in the United States 
today arc prepared in such a way that they are 'valid' , 
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tnat they will give a 'noirmal' distribution <^f results, 
and that they represen t the most common purricul^a in use^. 

Each of these concepts has seraouib problem!,. — By W a ll- 

dity' thjB test makers muan that the results on the stan- 
dardized test have been correlated with re^irtts from ^ 
some other measurement. But, in fact, reading tests are 
not correlated to some independent measure of the ability 
to read: the correlation that is. generally used is only 
to other grades or tests in school. They are correlated 
to other paper and pencil tests, usually of tfie int^^Uir 
gence or achievement ktnd. 

The tests are also constructed to show a 'normal* 
distribution of children, one smooth curve with not too ^ 
many spread out at the bottom and not too* many spread out 
at the top, and most of the population distributea around 
some average value. Two main arguments are used to jus- 
tify this procedure. First, it is argued that this is 
generally the way attributes distribute themselves in 
any large experiirental population: if you measure the 
height of many children of one age, you yll find a 
•noinnal' distribution, wrth a large number of children 
near one particular measuremei>t (on- both sides of it), 
"and Xhi^xestr-of t4ie -population trailing off to much 
greater or lesser heights. Whether this holds for the 
entire population in such developmental activities as 
reading is not known and there is really no way to find 

out. * , 

There. IS something quite arbitrary in the notion 
that at every age and every developmental level, iw mat- 
ter what property is tested, the results will distribute 
evenly along a normal distribution curve; that is, some 
people cannot do it, some can do it quite well,- and the 
majority does it adequately. Certainly, if a number of 
18-month to two-year-old children were tested to see how 
many steps they could'walk in a fairly straight line, the 
population would distribute itself more or less normally, 
with some childr^ not being able to walk at all, and 
most of them' only able to manage a small number of steps. 
(Of course, even here the distribution would not be nor- 
mal, because'a few children mrghj: walk so well that the 
measurement of individual steps wonld be almost silly.) 
But a test of ability to walk at- ag# six should yield 
something quite different from a normal distribution. . 
First of all, we would expedt all children, ^xcept a small 
fraction of handicapped children, to be able to do the ac- 
tivity. Then, to set up a walking test for six-year-old 
children that would result in a normal distribution would 
mean, first, a strange definition of ''walking," and se- 
condly, deliberately devising tfest items (such as walking 
on your hands, or running fast or doing complidated dance 
steps) so that the nature of the teat would force a nor- 
mal distribution of the results. This is precisely the 
.situation with= reading tests. They are constructed at 
evciy level f rom pre-kindergarten to high school so that 
the population that is tested will distribute around 
some norm. 
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The second justification that test makers give for 
using nofmal distributigns is that the statistics acnd the 
methodology For such distributions afe well k(U)wn, and 
easy to work out;. But even normal distributions, if that 
is what the population shows, can have variations. The 
horizontal shape of the curve is important: do most 
people cluster around. a mean, with only a small fraction) 
of the population trailing off at the extremes, or is 
there a very wide spread of results with only a slight 
cluster around the mean? 'Tests are constructed with some 
spread determined that will make the grades and scores 
easy to handle: not too much spread and not too tittle. 
This characteristic is particularly important when the 
test is given to* a population of students who generally^ 
either don't do very well on the test or do exVtemej^ ^ 
well: the standardized tests tell you mainly/that you 
can't say very much ^ about these children frora/t^at parti- 
cular test. But, of course, in education i^MS precisely • 
the children who are far from the average *out whom we 
•need the information. . • ' j- 

* There sure also some questions aboy^ the standardi- ^ 
zation methods 6f the^ tests. The problem of finding a - 
test population of childi;en to standardize test items is^ 
really quite serious. The 1958 version of the MAT was re- 
ported to he standardized argainst a sample that greatly ^ 
over-represented southern and- rural school districts at 
the -expense of northern and urban districts (Hunter ard 
Rogers, 1967). In order to evaluate a test item, someone 
' or some group of people must go into schools, find thou- 
sands of children, give them the sample test, and see 
what fraction of the children get the correct answer. 
Now anyone who has worked in schools knows that gaining 
entry to classrooms to do any sort of tesearch or study 
is not a random process. It involves a certain amount of 
political work, getting to know school system people, and. 
choosing school systems "and ^individuals who are coopera- 
tive. School officials, quite reasonably, wanf toicnov/' 
where strangers go and what they do. So the work that 
must be carried out to standardize a test already raises 
questions about the nature of the. sample. 

Further, t,o obtain an appropriate body of questions, 
the test makers not only average an^ manipulate the dif- 
ficulty of the questi6ns, they also design the content so 
that it wil]5^ reflect the most widely used curricula. And, 
since shrewd publishers develop curricula with an eye to 
matching the tests, that is', con^tain the most-used words , 
etc., a vicious cycle ensues in which the tests and. cur- 
ricula (developed hy thef- same groups) justify each other, 
while having little relation^© the lives andf^ achieve- 
ments of children. Any examination of tests will reveal 
' that the vocabulary, styU,, and material content are very 
much school-er ented, and not life-oriented. .They certain 
ly do not contain any vocabulary or structure correspon- 
ding to black rmgUsh, as described by Labov and others 
(Labov, 1972). But neither do tWcy reall/ contain the • 
language of any children. Tho test words and stories are 
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a bland melange of the dull fare ^ found in. school readers.. 
One even looks in va^n for- evidence of the new^r curricu- 
la that have been introduced into the schools. It is 
widely assumed, for, example, that the 'new math has* ta- . 
ken. over the schools ,^ that set tjheor/, other- thin -base 
ten system, and 'Various mathematical definitions have be- 
come important. This certainly doesn't s how up i n th e, : 
tests As memb.ers of Educational Developmental Center's 
(EDC) Project One ^ave shown ii\ a recent -ar^alysis of the 
math te^sts, SO to 70 percent of the queetions deal with 
simple computation in tfte base ten system and the rest of 
the material is heavily directed towards simple ^defini- 
tions. The few questions that deaLwith modem niathema- • 
tical . concepts are oft^a. ambiguous or misleading, ^ajid 
sometimes jusf wrong. 

4. Hierarchy of Knowledge. The last ^hree joints 
all deal with consequences of assumptions inherent in the 
development of the tests, rather thaii„ with_their general 
characteristics. The process of constructing test- items 
--definitions, problems, words --prodfeeds under the'as- 
sumption that there is a clear hierarchy of knowledge: 
that some things are harder than others, that,:l6me ac- 
tivities are, and should be, learned later, than othert, 
that the^kind of problems, that children ^an solve or the 
kinds of material they can read can be strictly graded 
and categorized from simple to complex. This assumption 
runs counter to several important principles of learning 
theory supported by open education practitioners. ^^I bave 
already discussed vthese individu|il differences: ^ leading 
styles, horizontal^ grof/th, anci individual rates of dev- 
elopment. 

5. Standardised tests used in the United States 
today are exclusively paper and pencil tests which mea- 
sure nothing but, sinipl^ reading skills, the naming of 
concepts or objects, and computation skills. DespitS the 
titles to the sections of the tests, very little els6 is 
measured. Most reading tests Have a section entitled 
"Comprehensien." But one way to answer the questions is 
not to read a paragraph and comprehend it, but simply to 
skim the paragraph, look at the questions, and then find 
the salient infdrmatrion . The test^ certainly do not mea- 
sure the comprehension of ideas; at most, they may deter- 
mine whether the person taking the test knows the mean- 
ing gf a word. The math sections have such titles as" 
"Concepts" or "Problem Solving," but the coacepts usually 
are defi,nitions or names and, the p:ipb^lem soli/^ing is more 

, often a ro^ading problem than anything else. 

6. The fact is that the standardized tests that 
are given dre just plain bad^, They are not even good 
tests by their own standards. For example, the Primary 
Form F of the MAT shows the childreri a math -ptoblem with 
the following figure: 

I — r \ I ^1 ^ 

o j k n 
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A child is asked co "Look at Vhe line segment at the top 
of the boy. Fill in the space next to the statement 
which iF true." 

The statements are: k is greater than n 

m is lees than j 

j is equal to k 

J is less than k 

dk 

It doesn't take' too much mathematical knowledge to know 
that you cannot define a line segment by one point. The 
statements are meaningless. 

I have racked this one example because it is not 
just a case ox question being vague, ambiguous^ or 
misleading. The question is simply impossible to answer 
at ail. It may seem like a small matter that one out of 
a set of 40 questions in a test which has a total of 114 
items, is^.inccrrect, but in fact the consequences of an 
impossible question are quite significant; on^ question 
can make a surprisingly larg difference in a gracfe equi- 
valent score. But a more mpbrtant point is that these^ 
incorrect questions, as we. 1 as many more that are ambi- 
guous and strange, appear on the tests at all. 

7. Probably the most specious argument made in 
support of stand .rdized tests is that evaluation is too 
important an activity to be left to individuW teachers 
and schools, and to the dangers of a great variety of 
st^dards and a good deal of sloppy measurement. Every- 
one knows how hard-i1?-is to make up good exam questions, 
the-^XgJjnent goes, so better leave the process to the 
' expertS^Srtm^tfcst^ out the questions on large sample popu- 
lations and ponde^^h«BUJtarefully. 

But the experts seem t<r-6omeup with grossly inade- 
quate measures. My first contact with t>-s world of stan- 
dardized testing was as a chemistry teacher in a private 
high school. .1 had a very bright, snjall class and we 
worked hard. Many of ^he students were the children of 
Caltech faculty, they were interested in so^ence and had 
good tia^ning. At the end of the year, I ^ve them c 
standardized examination prepared by the ubiquitous ETS, 
organize especially for independent schools. (As far 
as I know this test- is still being given.) But the test - 
I round, contained soma questions that were sijnply incor- 
rect} a drawing of a laboratory experiment sh ^ed a to- 
tally unsafe situation which might blow up at any moment, 
and some that were ?impl/ irrelevant^. Wt is the Solvay 
f^rocess? The latter was, in fact, an industrial process 
already becoming Obsolete at that time. / In my youthful 
enthusiasm iind anger, I showed .the test/ to a numbex of 
faculty members--prestigious chemists, /members of the 
National Academy of Sciences, and leaders in their field. 
Th^ all agx.QQd that the test was stupid, wrong, ambigu- 
ous, end inappropriate for a reasonable chemical educa- 
tion. Yet when I wrote to ETS abouf it, I got the same 
answers that the supporters of tests still give: They 
also had consulted experts who saw nothing wrong with the 
test, they had gone to a good deal of troubl? to standar- 



dize the test questions, and they simply couldn^t go 
about changing them. 

p 

WHY ARE THE TESTS SO BAD? 

But the major concern aboj<^trhe tests and their influ- 
ence does not depend on /heparticular criticisms that can 
be levelled against them, ^fhe tests fail by the very 
standards of the experimental paradigm within which they 
are ma^^that is,. they are poor tests with ambiguous and 
Incorrect questions. What is of greater concern is the 
way the tests succeed within the wider framework in which 
they are used: namely/ they, are one more component in the 
.sorting system of Airerican 3c;ho61s.^ They contribute one 
element (alt ^ugh not the only one)', one necessary condi- 
tion (although net i sufficient pne^ to s^e to it that 
the schools contin. the society as it is. Society uses 
schools to' r-^rt out and classify, ^o reward those who 
come from the middle-class and keep down those who are 
already poor; and the tests help in this major social 
effort, rhey j:ouldn*t do it alone, they sin ' 'ontri* 
bute. And as long "Us they do that job, whidi pens to 
be independent of the specific test items, un'rt ited to 
whether or not there are ambiguous questions, they can 
^continue to be used and used effectively (Karier, 197:^. 

The research paradigm within which th^ tests aye '* 
constructed is actually very good for determining major 
trends, making gross distinctions: distinguishing between 
those who can read in general and those who cannot, be-, 
tween those who can compute reasonably and those w>*o 
really struggle with nun.bers. This sort of distinction is 
easy efvottgh-t o mak e , ^ n d sin ce .the lest * sign is good . 
enough to determine these gross differences it doesn't ^ 
really -natter too much if a few question's are ambiguous. 
Actually the ambiguous questions £Uso serve an importart 
function: they^make the tests better at the kind of clas- 
sifying for which they are used. The tests don.'t do vei> 
well at describing individual styles, levels of achieve* 
ment,-dr usable knowledge, but they do test the ability 
to follow instructions, to i>ot think too deeply (that^s 
one wa>^ to avoid the ambiguities in many questions), and 
to do reasonably neat clerical work at a steady pace with- 
out thinking about it too much. 

One measure of the extent to which the tests don t 
accurately reflect the abilities and knowledge of indivi- 
duals if He number of exceptions to expected results. 
. Every person active in educatij|ii has her own store of an- 
ecdotes about Jane who did po«ly on an ^4AT, but could 
the work; of Frankie who coul(Pread only on the second 
grajic^evel, but after two months of help could read on 
tho sixth grade level; of Janice whose IQ rose 25 points 
in a year. In some cases, where peoplo have looked care- 
fully at children ana worked sensitively with them, \:hole 
*cl asses and groups have made phenomenal increases in their 
IQ scorcb their grade level achieve.nent over relatively 



short' periods of time. In Reading, How To (1973), Kohl 
reports the case of Lillian, a child whose performance 
improved so much that it required the threat of a law c 
suit to force the sch.ol to accept the results of three 
reading tests. This phenomenon is further dociunented by 
a report from the Far West Laboratory for Educational Re- 
search and Develbpment (Rayd^r and Nimnicht, 1973) con- 
cerning some of the results iri their Follow Through pro- 
gram classes. The authors demonstrated that the children 
In classes in 14 school systems across the country in- 
creased their average IQ scores on the Wechsler test of 
intelligence by significant amounts over a three-year, 
period of the program. They went from scores that were 
mvxh below the average for the cc itry to scores that 
were above lhat norm. The authors concluded: 

First, intelligence tests are not reliable measures 

of the abilities of these children second, the 

problejn of cumulative deficits is with the school 
not €he child. 

In other words, standardized tests are one link in 
a long process that tells poor children, and especially 
poor black children, that they are on the bottom of the 
heap and sho.uld stay' there. That is why American schools 
continue to use tests which are inadequate even by their 
own stated goals, and which have become one of the prin- 
ciple instruments through which schools serve to maintain 
social and economic inequality. 
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Evaluation Alternatives 



REFORM OF STANDARDIZED TESTS 

One approach to ''the major disaster area in education/' as 
evaluation was recently called by James B. Macdonald (1974), 
would' be to improye the standardized tests. From the fore- 
going criticism, it is obvious there is rooB for a great 
deal of improvement. The questions coald be better, the 
standardization could be more representative, and the val- 
idation against criteria more appropriate than the ones 
that are used. More imaginative fise of the available tech- 
nology could vastly improve even paper and pencil, machine- 
graded examinations. It it is accepted that there is more 
than one way of doing a problem, why not present the al- ^ " 
temative ways on the test and- grade anyone 'correct* whp, 
simply, solves the problem, whichever way he or she does 
it? the whole notion that the scoi^ng and administration 
of the MAT is done on a basis of total correct answers in 
each area without any further modification is really quite 
absurd. Why not a choice of questions, cr questions which 
relate to a wider range of skill, or the possibility of 
more than one correct answer in some -iases? Moreover, is 
there any reason at ail to limit the concept of standard- 
ized -achievement to paper and pencil tests? Why not 
standardize a much broader range of activities if this 
were desired? 

Unfortunately, any effort to reform the tests has 
two. major drawbacks. First, it ignores the analysis of 
^ why the tests are so bad now, Tu assume that aJiieving 
better standardized tests is simply a matter of making 
changes in the tests themselves is, I believe, to be 
naive about the education world and about American society. 
It is highly unlikely that all the people who put the tests 
together, suggest the questions, write the language, try 
them out on children, standardize them, and finally pub- 
lish and sell them are all totally unperceptive and uned- 
ucated. The tests and their u^e are deeply eiubedded in 
the fabric of American society and must be rejected on po- 
litical grounds, not modified at the technical level. 

Secondly, any proposal for i major effort to produce 
new testing mechanisnis is retiiiniscent of the program that 
was launched almost 20 years' agdto^produce new science 
and math curricula. Scientists and mathematicians who 
turned their attention to schools were horrified at the 
state of the situation: the curriculum was simply bad, 
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"""liw^L.^aid, full of error, wrong concepts, incorrect state- 
ments7^'ft>^flLUch stress on rote learning, simple drill, etc. 
They set out tfcTr^form education by updating and correcting 
the curriculum, to mak^nt^Jbetter' , One of the marjor 
learning experiences for those ifvxplved in that curriculum 
refoim was that .lew curricula, althoU^ .^.necessary condi- 
tion for better school experiences for chil'dreiu^as hardly 
a sufficient change. ' In fact, much of the new curftculum 
was'^neatly fitted into existing school structures (indeecT^---- 
it was designed for this) and instead of the ciirricula 
changing the schools, the schools absorbed the new curri- 
cula without much modification in the essence of the 
schooling provided -for most children. In many ways the 
new curricula was simply ignored. While the rhetoric of 
the New Math has had ^'^^e acceptance in the schools it 
would be hard to kn . c from many of the day-to-day activ- 
ities in the classrouras, and difficult to discern it on 
the items which appear on the standardized tests (Sarason, 
1971). * 
To try to 'correct* or save education by simpl)C out- 
' fitting the schools with bett^. testing procedures is in- 
adequate as a strategy. As in so much else, parts cannot 
easily be separated from the whole. To hiring about funda- 
mental change in the schools, the entire program must be 
reexamined: curriculum, evaluation, teaching style, views 
of learning and knowledge, etc. 

There is obviously some merit in developing a more ^ 
reasonable and wider-ranging approach to standardized tes- 
ting, as long as ore neither expects the task to be s.imple, 
nor .hopes to change education by this alone. The area of 
developing alternative tests is a wide-open fi«ld; remark- 
ably little work has been done in it because the standard- 
ized achievement tests and their companions, the widely 
used intelligence tests, so dominate the field that little 
♦else has been tried and certainly little else has been 
carried vary far. An appropriate analogy can be made with 
the automobile industry. At one point, in the early de- 
velopment of automobiles in the United States, a wide 
range of design and approaches to the problem of mechani- 
caT energy-driven vehicles were explored: different en- 
gines (electric and steam) as well as other fossil fuel 
(such as diesel fuel) competed with the high-octane gaso- 
line model. But the gasoline-powered internal combustion 
engine was- so successful, it spread so widely over the 
» market, that many other tec';nologies were simply not fol- 
lowed up very much. Today we know a great deal about the 
gasoline engine that uses rather a lot of gasoline, and 
very little about the alternatives. Its commercial success 
and relatively low cost (which was related to that success), 
along with the low value placed on the various problems it 
represented (that is, as long as there was no gas shor" 
simply made it unnecessary to do other work^^^^^^ 

There is, however, another componejat-^crf^this analogy 
which is not quite so innocent . .>it5f5g with developing 
its technology, the automobi^i^ndustry evolved policies 
that channeled and directed research, labor, and expendi- 



4^^ 



tures in the direction of privat'^ automobile travel and 
away froiff mass transit. Decisions that had profound effect 
on our society served to benefit a particular sector of 
private industry, naiiiely the sfonor of those decisions. 

^ As the Boston Globe observed in CDiranenting about a recent 

^^nate_ subcommittee report: ' 

GM, Foxdr '^^^tihryf ler Ireshaped American ground trans- 
portation to serv« corporate wants instead of social 
needs. This study suggopts^ttrat-^^jnonopoly in ground 
vehicle production has led in evitably^^t^^^ breakdown' 
on the natipn's grpund transportation. 

The report further documents how, beginning in 
the 1920s, General Motors began to buy up rail and 
electric urban transportation systems and then re- 
placed them witih buses or diesel Ibccmotives, which 
. it nftmufactured tMarch 1974). 

^.^^^^ The same report, the Globe reported on March 3, 1974, 
al gp'^'dec^men t s th^t changes in styling in the automobile 

/ industry th^miglj^^e years were not necessarily Tel^ted to 
improvements in toCfciOlpgy (Rothschild, 1973). 

It may well be quest"*^©ix^ whether there are similar 
interests involved in the..continirmft^jise of large-scale 
standardized testing programs in our ufB»-v£enters . The^ 
companies that produce standardized tests are'^^jftiiJllous to 
the big three automobile manufacturers :^Jthey'^dominat 
their market and dictate what ij^amf fsn't profitable, but 
their outlook is limitjedr^jy^what they have found success- 
ful. Commer^ria^l self-interest makes them unwilling and 
unlikely to speculate on different projects that would 
undercut their own positions. And, like the big three 
automobile manufacturers, the publishers who produce test- 
ing programs are not isolated from the rest of society. 
They have connections in schools of education, foimdations, 
and government that work together to maintain the statue 
quOy just as the automobile industry has connections in 
research institutes, regulatory agencies, and government. 

JOne strong argument continually made for maintaining 
the present evaluation system is the cos^ factors involved. 
It is simply a grdat deal cheaper to give the MAT to every 
child in the school system than it would be to introduce 
any of the alternatives' that have been suggested. It is 
undeniably correct that it is much cheaper in dollars and 
cents for any particular school system in 1974 to bq/ MAT 
booklets for every child and give these tests than to es- 
tablish some sort of individual observation system to de- 

. ^ermine the status of each child. But the total expenses 
are 5ir-different that they cannot be compared because it 
is a little lTk^^^€on^):9>ring the cost of gas for your kit- 
chen „stove and the cost dT^a&talling a nuclear-powered 
technique for .preparing food. ATcitrh^a^that already has 
a gas stove will also have appropriate cookiit^-*«t^ivsils , 
a line leading in for the gas, and stores nearby whTcH 
sell food that can be .easily prepared by gas stoves in a 
short time. To compare the real costs of two totally dif- 
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ferent approaches to food preparatiorr, -one would have to 
take into account tjxo- investment^ that has been made in all 
these thiivgs and the development costs that went into set- 
rinc up a food distribution network to cater to that style 



The cost of feeding the present testing machine is 
quite small in comparison to setting up another one, but 
that does not mean the total investment in it is* small. 
In fact, school systems spend .a great deal of money on 
testinj and evaluating^ children. Besides the cost of the 
millions of test booklets, which are not reusable, there 
are a number of personnel in the school system, especially 
the city systems, but smaller ones as well, whose job is 
to^give the tests, organizing the test-taking, etc. Tea- 
chers and children spend a good deal of time giving and 
taking tests. In ->ome Follow Through sites, as many as 
six weeks of the spring term were totally lost wjiije the 
classes went through the agony of taking the various re- 
quired tests dictated by the city, the program, etc. The 
whole experience simply-<iis.rupted all instructional acti- 
vities for a month and a half TtTi'ar is about 18 percent of 
the totnl school year). Nor do the above costs include the 
human and social factors: how the tests affect programs, 
how they tyrannize teachers and demoralize students. Also 
not included is the incredible inefficiency of testing.' 
Typically, children are tested sometime in the fall and 
spring and the comparative results are released very late 
in that year or, often, in the next year. Teachers cannot 
e»/en use the tests for their own teaching purposes; they ' 
can only be used as a weapon by outsiders, after the chil- 
dren have moved on to the next grade. 



ALTERNATIVH STRATCGIHS FOR MEASURING CIIMDREN'S LEARNING 

In ternns that have been made familiar by Thomas Kuhn 
"(19701^ there is always a. prevalent paradigm in any sci- 
entific activity (perhaps in any human activity^ within 
which a maj ority cHfLtjie work is carried out. But there is 
usaally a small minorit>^Gf. work going on outside ,it, and 
the major breakthroughs in sci'^ea^occur when a new para- 
digm replaces an old one. Likewise j^^^'hr-ci^aluati on work, 
the vast majority of activity falls within "tTTt^-accepted 
experimental-psychology-research paradigm, blit thef^h-a^^ ^ 
been a small ongoing tradition of work outside that para- 
digm, and open educators are waiting hopefully for the 
over-throw which will allow a breakthrough in our views on 
evaluation. There are indications that evaluation alter- 
natives are becoming more popular (Eisner, 1972; Parlett 
and Hamilton, 1972, etc.). 

An older Amcricat) evaluation effort (Aikin, 1942) 
IS worth discussing briefly because it transcends the 
paradigm. In 1932, the Progressive EducatiQn Association 
launched a major effort to determine what, if any, influn 
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around the country were followed throughout high school 
and college for a total of eight years. The evaluation 
activity involved a number of factors besides standardized 
measures on studei\ts. The staffs of participating high 
schools were particularly concerned about their programs 
""chtr-ia^^this time and used the fact that they were part of 
the sUidyTtJ^^^^-xaj^ne and modify tl;eir activities, and the 
colldfes involved agf^^xKtajwaive admission standards for 
the students involved. The stuaf-^e^Ltojneetings between 
the cooperating schools and colleges, and Tr~5^4mwlat^ed 
curriculum changes in both. -^—^ 

The actual evaluation work included questionnaires, 
records, unobtrusive measu-^es, intervie-^s, etc. The best 
description of the evalu-.Lion/education activity, can be 
obtained from quoting rhe summary of their neglected five- 
volume work^ 

In the comp^arison of the.l>475 matched jairs, the 
college Follow-up staff found that 4:he graduates of ^ 
the Thirty Schools 

1. earned a slightly higher total grade average; » 

2. earned higher grade averages in all subject 
fields except foreign languages; 

3. specialized in the same academic fields as did 
the comparison students; 

4. did not^iffer from the "omparison group in the 
number of times they were placed probation; 

5. received slightly more academic honors in each 
year; 

6. were more often judged t6 possess a high degree 
of intellectual curiosity and drive; 

7. were more often judgedto be precise, systematic, 
and objective in- their thinking; 

8. were more' often judged to have developed clear or 
well -formulated ideas concerning the meaning of 
education--especially in the first two years in 

more ofteiTttcmoaaJrated a high degree of re- 
sourcefulness ia meeTTrrg-H^vew^^srt tions;^ 
did not differ from the comparisoTr-f«ujp in abi- 
lity to plan their time effectively; 
had about che same problems of adjustment as the 
comparison group, but approached tKeir solution 
with greater effectiveness; 

participated somewhat more frequently, and more 
pften enjoyed apjJreciative experiences, in the - 
arts; 

participated more in all organized student groups 
except religious and ^'service*' activities; 
earned in each college year a higher percentage 
of non-academic hofiors (officership in org^niza- 
rteas^ election to managerial societies, athletic 
insignia,- leading roles in dramatic and musical 
-^^esentations) / 
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The graduates of the most experimentjal schools were 
strikingly more successful than their matches. Dif- 
ferences in their favor were muc;^ greater than the 
differences between the total ^'irty Schools and 
theii comparison group. For^tnese students, the 
differences were smaller a^ less consistent than 
the total Thirty Schools^^d their comparison" 
. group, (p. 14&) 



De cla^i- 
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Other work has b^en carried on in an 
op evaluation alteri>atives. The3e efforts ca 
fied as follows: 

- — Differ^t '^amtardized' Tests ^j^^^tin^reform which 
has been "prepaid is to mdve from 'nori^^^re^renced tests 
to"" 'critefic^-referenced tests. In ^rijt^ion- referenced 
tests, iteH(s are not correlated widr s^e other scale of' 
what chilOTen do on these tests w?^!v^^ some standardiza- 
tion which simply compares cfiiLrfreji with each othjer. In- ^ 
5tead> items are correlated wilm^ctual ability to carry 
out/^ome task. A norm-stan(i^p<uzed test can tell you w)iere 

^hild stands relativeuW^'^^^ rest/of the population t^iat 
/was used for the noxnTa^^^^^g procMure on that test item; 
^a criterion-re^efence^^l^st can tell whethei a Child can 
do Bomething^-that kM/oeen correlated with that item. In 
^^rincij^e yiffy^ ^^"^^^ goo^, s^id, in fact, if care- < 
fuTl:y-4one, it/c!^ lead to a much more satisfactory ap- 
proac^i to^t^^^^g strategies. But there are some serious 
difficultl'M^ The ul timate in cr i t er i on - ref eien ceOest5-- 
"ls~dotng''^^e Task itseir; TTyou want to know whether a 
stucTe 
But^ 



TTask itself. 

repair a car, you have her repair the car. 



course^ the whole idea of standardized tests is to 



s;dD^itute some simple easily repfoducable and generali- 
/z£*le activity for the things you really want to test for. 
yfhe more complex n:he activitjr-that you want to evaluate, the 
harder it is to make a reliable criterion-referenced test. 
This is'reflected in the fact that, many tests that are re- 
ported to be criteri^on-referenced leave some question about 
the relation between what is tested and the activity, or, 
more commonly, have defined a trivial activity, or one that 
only has reality in the world of tests, as the criterion 
that has been used as a reference. 

It has long begr a standard procedure to have 'lab' 
exajns in experimental science subjects. Many biology stu- 
dents reniember vividly the difference between recognizing 
a drawing of a microscopic object on a paper' and pencil te§t 
and identifying, it.under the microscope in a practical ex-, 
am. Much "of the knowledge that children gain in school is , 
of tKe practical, hands-o.i type, and could be tested accor- 
cTrngi-y^^^Jt is particularly inexcusable that sciencd- Team- 
ing is evalu^Ted. almost exclusively .by paper and pencil 
tests which essentially measure reading ability and J/ittle 
more. Even the definitions that are so prevalenr the 
science portions of standardized tests usually measure only 
two things --whether the student can read the. nartie 0% some 
scientific object or principle, and whether the^tudent can 
associate that with a related term. Neither of th^se skills 



43 



49 



/ 




/Covers a significan(^ fr^tion of what, could be considered 
scientific litetacy. /Also, the line drawings which^accom- ■ 
pany many test ii;enj>P*'for younger children are only a par- 
tial substitu^^^ naming; they are highly styli/zed and 
symbolic repres^tations/ not even photographs. 

Some. rMearch groups have substituted objects, pho- 
tographs, /d^^grams, and maiiipulative materials for paper 
^d pm(^£y^Xest problems. This makes ^it pi^sible to dis- 
cover .a/number of things about children's abilities in- 
depeiyient of their reading skills. First,' a child who un- 
. deijaftands the principle^ of an electric circuit, cap light 

ap/tulb if given the proper materials even if she could not 
y^swer a written question about tiie subject. Secondly, 
'tsing materials tells you something about- the way a child 
goes about a Broblem^. . Are groups of objects simply enu- 
merated or s^e sub/Woups ac^e'd or multiplied? The actual 
way a child manimiates m^^erial informs the observer 
tbout the approaxh us^fd/^ch mare than any particular ans- 
wer on a scored s)\eini< Most people who bother td do this 
kind of woi^ with^^ildren usually cprhe away profoundly im- 
pressed >5tith /tWaimit^ notions th^y have of how children 
think 9iid l^a/fiT This type, of p^lem is just objective 
as any jj^rper and pjeThci latest, €^ at least it can be made ^ 
ju^ a^'j^^ectiv^l / y/ 

3. recefnt 3)frt6itious,/eva4!uation effort (Comber and 
K^v^, 1973)^hjtmdreds o/ th<Jusands of students in 19 
'^Qt^Ties^ere ^iven .extenfiS^e stafidardized science tests 
order/to assess seit>ndi^education on a global scale. 
The extensive technirtai^ocument which reports the results 
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Materia^^caft be used to make possible open-ended 
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discovery, and work with a wide range of materials had any 
appreciable effect on the children, she carried out a study 
in which she took two groups of children: tl ose who had 
had exposure to the APSP course (a materials rich, manipu- 
lative science program) and those who had had only tradi- 
tional education in school. She simply placed them in a 
foom with lots of material (not the same materials used in 
I the APSP cours*^^) , and watched what" happened. She noted 
that the test group--those who had been exposed to APSP-- 
wcre more inquisitive, did mo/e things, more connected and 
sequential things, asked more questions of their environ- 
ment and used it mo.-e adeptly than a group of cTjildren who 
had not been so exposed. It should be relati-vely easy to 
extend this approach to evaluation to the day-to-day life 
of American schools. 

In this approach, the observer isn't certain before 
doing the \vork just what behavior will o^cur in the experi- 
mental children. It is an open-ended evaluation: an ef- 
fort to say, '*let's see what these children do.'* In this 
sense, it an application of the most sensitive and sen- 
sible 'evaluation strategy of any one of a number of acti- 
vities based on the approach of the 'clinical interview'. 
Ob\iouslv, the only way that we can ever measure the new 
or novel things that children do is to have an assessment 
instrument that leaves room for observing new and unex- 
pected behavior. This requires hott the input of enough 
material from the observof to give the child something to 
work on, and enough freedom on the part of the, respondent 
to take advantage of it. The^style represented by the 
Piaaetian intervie;% of finding out 'where children are at » 
is perfect for this approach. Using this same approach 
it'is also possible to find out where groups of children 
--^irc with ro_sjpect to certain concepts, qv types of pro- 
blems, or style* oF'knowTcd^-. 

Deborah Meier's revealing study about children ^ re- 
sponses to t' e M\T is an exanple of the use of a clinical 
interview to find out what ' ch i Idren know. In this case, 
the material of the evaluation was the standardized tests 
uhich the children worked on. Uy talking with them, it was 
poi^siMe to find out a great deal about their knowledge., 
assumpt*Kins, frames of , refer .*nce , etc. 

5. Check lists for teachers to guide them ^in evalu- 
ating child^-en's learning arc powerful evkluat ive tools. 
Some li^ts arc available to cover reading achievement, 
m.ith skills, and ^c:enco knowledge. I>ist> of this sort 
h.ive a tremendous flexibility of use (although they are 
/iso subi-ct to the danger of overly rigid application), 
tl ey do not require elaborate 'test administration proce- 
dUi*es, they can be indn^dually applied iind they provide 
infoniiatipn directly to the teacher. One big difference 
between check lists a.^d more formal tests is that they 
Usually are not consider. 4 total descriptions, but guides 
In fact, if they g(^t too detailed, they become less use- 
ful. \*list of reading accomplishments need not cover 
every technical detail of a child's reading mastery, but 
it will give a teacher a sense of where that child has 
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arrived at and what the child aeeds ^help on. At the same 
time, it serves to remind a teacher oF skills or pa^ts of 
.a process that may be missing from a child's repetoire. 
An example af such a diagnostic, open-ended reading guide ^ 
is presented in Evaluation Reconsidered (Norris, 1973). 

4. Record Keeping. A classic means for evaluating 
children's growth and development is some systematic re- 
(jording of facts or events that involve theip. Jhis is 
the basis of any sensible evaluation of childrsn. It is 
a method that all parents use informally. We observe out 
children, note the changes they undergo, and judge their 
development on the basis of these changes. It is fairly 
easy to note major differences with a small number of 
children, §o most parents don't keep records of when their 
children first walk or talk or perform certain intellectual 
feats. In a school, where there are more children per 
adult and the adults concerned with the children change 
from year to yeajp, more fprmal records are necessary. The 
problem is that most schools keep rather dull, and not very 
useful records: most often some adult assessment of the 
general level of the child and a compilation of standard- 
ized test scores. The anecdotal records are usually spotty 
and incomplete, while the standardized reading scores are . 
simply not helpful, even on a cumulative basis. 

-Used niore imaginatively, record keeping has vast 
possibilities for assessing the growth of children.* The 
work of the Bureau of Educational Experiments, founded in 
\ 1911 (recently reprinted) , contains ^explicit discussion of 
efforts to assess children's growth through record keeping 
before World War I (Winsor, 1973). in the pioneering re- 
farm movement in the Vienna school system between the 
Worl4 Wars^, report cards were abandoned and, instead, each 
chiid was given an elaborate form which recorded^ aspects 
of her social, intellectual, and emotional development 
(Papaneck, 1962). 

A. mora contemporary extensive and thoughtfyi> effort 
of documenting children's growth and development has been 
carried* out for nearly a decade by fat Carini (1973) of the 
Prospect School, North Bennington, Vermont.* By keeping 
a varjety ot records, she and her colleagues have amassed 
an impressive amount of revealing information both about 
general aspects of children's growth and specific inform 
mation which is 4:i^lpful about particular children. In- 
cluded among these are: 



Children's work: e.g., drawings, photos, etc. 
Children's journals (generally only for children 

aged 11 and older) 
Children's notebooks and written work ^ 
Teacher's weeJcly records 
Teacher's reports to parertts . 
Teacher's assessment of children's work in math, 

reading, activities 
•Curriculum trees « 
Sociograms 



Records are another 'objective' form of evaldrtion, 
and the longer they, are kepi, the more objective they be- 
come. A single estimate of how much time a child spends 
^in math activities may be way off, but 10 such estimates 
in a month pfbbably average out fairly close to a correct » 
ifigure. One component of any successful record-keeping 
activity is longevity. Almost any measure or recordbe- 
comes interesting and able to tell you something if you 
keep it long enough. 'Historians have long ago learned the 
power of such apparently 'trivial' data as vital statistics 
wnen available over long periods of time. 

Of course, the establishment of a Tecord-keeping *§ys- 
tem is' not an easy task. Who does, the work, who stores 
them, who looks at them, what do you'record, when, how, ^ 
etc.? All these are questions that have to be addressed; 
then someone has to see to if^that whatever procedure is 
adopted is maintained consistently for long enough so that 
information can be drawn from it. But this sort of evalu- 
ation has proven to be an extremely useful way to know 
what children are doing, what they are capable of, and the 
areas in which they need help. Records also provide in- 
valuable information for program evaluations^ ^ 



EVALUATION AT TIIF. PROGRAM LEVEL 

The whole field of evaluation is much larger than the con- 
cern for the evaluation of individual children's growth ^ 
and development. To the extent that the pr'esent. methods 
used in the schools to measure children's achievement are 
inadequate, this inadequacy is magnified all other le- 
vels of evaluation. The public schools simply do not have 
thorough unbiased . hods developed within .their setting 
for systematically owing and recordin-g children's de- 
velopjndrt and progress, and what the next best steps for 
them might be. Also, the public schools have not devel- 
oped adquate systems to support teachers making day-to- 
day decisions about the best' opportunities to provide for 
children. The present system, with its ta'bulations and 
aura of objectivity, simply permits administrators to 
feel they know wh^Jt is happening and can make rational 
decisions. A number of schools follow the barbarous cus- 
tom of posting the standardized achievement test scores 
in the principal's office by grade and teacher, so that 
the teachers can all be compared in terms lOf the results 
and so that, presumably, they will have an incentive to 

raise' the standing of their class. It is certainly the 
case ih many schools that teachers believe, with good • 
reason, that their future salary increments and promotions 
depend on these results. The test system therefore be- 
comes yet another competitive situation in the schools, 
with higher scores becoming the production goal, like 
Stakknovite practices in Russian factories^ under Stalin. 

^ r Yet is it not usual for descriptions of programs, 
statements of educational aims, and official instructions 
to personnel ^to include broader goals than simpl)^ the 
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attainment of certain scores on children's achievement • 
te^s? These gcfa^s may cover affective growtl^ social sit- 
uations, interest, personal growth, and a wide ran^^ of 
other issues^ "^If these larger issues are taken seriously, 
then a wider range of evaluation strategies must be, employi- 
ed. WheTe this imperat-ive has been recognized, every con- 
ceivable activity fla% .been use^ at one time or ^otn^r to^ 
assist in making jud]graents, iittl^ding interyiews^.<iue3'- 
, tionaires, other psychometric tests, cost amJj^esV <lppni- 
munities* reactions, hunches and political consldeii^tions . 
Becguse the tange of activities that iltay be involved in - 
the wider range of evaluation situations is so broad, no 
specific critique is possible. ^ 

Often the political situation is- such thatr^-evuL 

though the funds available are not sufficient for a thor*. 
ough analysis, a 'formal* evaluation must be carried out. 
As there is no easy approach, some, hodgepodge of activity 
is thrown together and called evaluation. It is in these 
instances that it becomes transparency clear that tha 
so-calliid objective evaluation is preoccupied^more-witk^ _ 
political and social issues than methodological ones. Of 
prim^ary concern are questions about whp wants particular 
programs, about nvhat their benefits are on the basis of • 
brt»^d social terms,, about what people ftaye to gain or lose 
by the implementation of a program or^ by the hiring of a 
teacher or^of a superintendent, etc. This is 4iOt to deny 
that a considerable* body of^ data, measurement, and mater-* 
ial can be relevant .to decision making and should be gatl|- 
ered and used as much as possiblis. Rather, it is to say 
that there are no totally objective approaches 'tq> decision 
making, as it involves people's most basic beliefs, pre- 
judices, and feelings. 

In summary, to improve the situation of evaluation 
in American schools, two things need ta be accompli sked. 
First?> the scTope of what is considered evaluation has So 
be vastly broadened, and this work has to become an inte- 
gral part of the educational experience. Evaluation i^* 
.^judgment and to make* judgments the relevant information 
must be assembled. It is- foolish to limit what is mea- 
sured and r,ecorded about children or programs to those* 
few bits of data that happen' to be available from present 
standardized achievement tesis. If evali^tion is looked 
at from the point of its relation to the rest of t\\e edu- ^ 
cational program, one can recognize how separate the two 
are at present. It becomes especially clear that child^jMi 
are^^urt and discouraged bv the present system, while ^ 
teachers are simply not assisted in their difficult tasks. 

Secondly , judgment^-4)f evaluation are part of an 
all- encompassing political-sociah atmosphere. One cannot 
expect that the formal part of evaluation will deviate 
very far from the mote general, informal judgments that 
are meted out by'the overall society. If the society de- 
cides that black children arp not as worthy as white chil- 
dren, or that girls arc inferior to boys,- then the formal 
^evaluation systfem will either reflect this judgment or 
its results will be ignored.. We can only hope to bring 
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about major changes in the ways in which evaluation is 
carried out at the same time- that >e bring about major 
changes in the structure of educat^ior and in' the society 
as a whole. ' 
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